The article takes a while to get to the point (there are different class labels in the NIH database that - to a practicing radiologist - are visually identical, but several ML algorithms seem to distinguish them, and the author is puzzled by that) but is a very nice read for a layperson like me, because it gives nice background. The article is also very well written. Looking forward to reading the other articles on that blog.
As far as I understand, the fact that the conditions are visually identical also means that the quality of training data is in question.
In case anyone runs into this old thread via search: the author has a follow-up post with that conclusion, that the ChestXray14 dataset should be treated with skepticism.