I’ve been meaning to revisit approaches to mass-parsing of data from Wikipedia. Back in 2015, I was able to put together a map of airline routes weighted by number of passengers. https://tech.marksblogg.com/popular-airline-passenger-routes.html
There is a fair amount of dispersed information on Wikipedia that is formatted in a uniform enough fashion that a basic set of parsing rules can take you far. Python was a huge help for this as you’re forever trying to hack together more robust rules. The simplistic syntax for juggling lists and string manipulation makes the whole task much more approachable.
I remember using wikipedia as a training material in my bachelors thesis, a decade or so ago. I just needed a BoW (bag of words) so I just parsed the raw XML files and extracted the plain text, and thought everything was all fine. It was fed into a few different LSA-like (latent semantic analysis) models. LSA is pretty cool, but no AI so probably out of favor these days. :-)
The amount of articles covering small American cities is so large that it makes the entire corpus weird, at least for my use case. I had to remove them.
Also relevant here is the NIST TREC CAR track, which provided a comprehensive structured parse of Wikipedia. The Mediawiki parser is available. The tools used to process the extraction are also available, although being internal tools the documentation could be better.
full disclosure: I have previously served as an organizer of TREC CAR and was responsible for much of the data pipeline.
Thanks, that’s super-interesting, and I haven’t seen it previously!
(As an aside note, for some time I went the road of MediaWiki markup parsing, too, but at some point decided it wouldn’t work. The reasons are in WikipediaQL’s README)
Indeed the markup parsing was quite a challenge. However, on the whole I’m pretty happy with how it turned out. It’s by no means perfect (essentially an impossible goal, given the difficulty of the problem) but it’s good enough to make sense of most pages that can be rendered.
The fact that we constructed the parser as a PEG parser meant that we could rely liberally on backtracking in the parser. I don’t think this problem would have been tractable otherwise. That being said, it’s not particularly fast. IIRC, it takes around half-a-day to do a full extraction from enwiki on a fairly large (e.g. 28 core) machine.