1. 2

    Also relevant here is the NIST TREC CAR track, which provided a comprehensive structured parse of Wikipedia. The Mediawiki parser is available. The tools used to process the extraction are also available, although being internal tools the documentation could be better.

    full disclosure: I have previously served as an organizer of TREC CAR and was responsible for much of the data pipeline.

    1. 2

      Thanks, that’s super-interesting, and I haven’t seen it previously!

      (As an aside note, for some time I went the road of MediaWiki markup parsing, too, but at some point decided it wouldn’t work. The reasons are in WikipediaQL’s README)

    1. 4

      I’m not familiar with the development of the Linux kernel, but shouldn’t all commits be reviewed by a human contributor before entering the source tree? I mean, if the culprits from UMN didn’t publish that paper, would these invalid commits get unnoticed for good? In that case, any malicious user could sign up for an email account and inject garbage or even backdoor into the kernel, which sounds like a big problem in the review process.

      1. 17

        I’m a former kernel hacker. Some malicious commits were found by human review. Humans are not perfect at finding bugs.

        As I understand it, the vast majority of kernel memory bugs are found by automated testing techniques. This isn’t going to change as long as the kernel is written mostly without automatic memory safety.

        1. 6

          Thanks for the input, but I was not talking about detecting bugs in kernel code written with good faith. What surprises me is that the kernel maintainers seem to assume every patch to be helpful, and merge them without going through much human review. The result is dozens of low-effort garbage patches easily sneaked into the kernel (until the paper’s acceptance into some conference caught attention). Software engineers typically don’t trust user input, and a component as fundamental as the kernel deserves even more caution, so the kernel community’s review process sounds a little sloppy to me :/

          1. 5

            the kernel maintainers seem to assume every patch to be helpful, and merge them without going through much human review.

            You seem here to be assuming up-front the conclusion you want to draw.

            1. 1

              If the patches were carefully reviewed by some human on submission, why didn’t the reviewer reject them? Well, maybe there are some ad-hoc human reviews, just not effective enough. These bogus commits were unnoticed until the publication of the paper, so it’s not like the kernel community is able to reject useless/harmful contributions by itself.

              1. 3

                If the patches were carefully reviewed by some human on submission, why didn’t the reviewer reject them?

                Because there are any number of factors which explain why a bug might not be caught, especially in a language which has an infamous reputation for making it easy to write, and hard to catch, memory-safety bugs. Assuming one and only one factor as the only possible explanation is not a good practice.

                1. 1

                  As others have said, review is not easy. This is especially true when done under time pressure, as is essentially always the case in FOSS development. Pre-merge code review is a first line of defense against bad commits; no one expects it to catch everything.