In a similar vein, I wrote something in golang to grab news articles and dump them to stdout (based on the original arc90 readability):
disclaimers: about the first golang I ever wrote, part of a bigger project, totally undocumented and un-idomatic and will probably make your computer explode if you run it.
Im puzzled. It says it’s a reimplemented readability.js, but uses libxml for parsing? Uh, no. That won’t get you good html parsing. There ought to be compliant and real world capable html5 parsers out there, no?
Python has html5lib, Rust html5ever. Both super well tested.
As someone who doesn’t know much about HTML parsing libraries, but has used Nokogiri (libxml2 based) for some basic scraping scripts, what’s wrong with libxml2?
[Comment removed by author]
Right, but Nokogiri (libxml2-based) has an HTML mode, which has parsed HTML5 just fine for me in the past.
Sure. Depends on the web page. If you want to be web compatible though, I’d personally shy away from it.
Can you elaborate on what “real world capable” and “web compatible” actually means?
libxml2 in html mode is the most competent html tagsoup parser I’ve ever used.
There is gumbo, which is packaged for most distributions (gumbo-dev, etc).
You can do HTML with libxml: http://xmlsoft.org/html/libxml-HTMLparser.html
HTML4 though. 😕