I extracted a lot of data from webpages and html files. I didn’t know this tool existed so I wrote my own a few years ago that does essentially the same.
Coincidentally, jq was also my main source of inspiration.
Oneliner to CSS-select stuff from URL in the clipboard: echo "data:text/html;base64,$(curl "$(xclip -o -sel clip)" | pup "$(rofi -dmenu)" | base64 -w0)" | xargs $BROWSER
echo "data:text/html;base64,$(curl "$(xclip -o -sel clip)" | pup "$(rofi -dmenu)" | base64 -w0)" | xargs $BROWSER
I used to do web scraping at the command line… But I’ve found it is pretty nice to do it in Chrome dev tools. The feature where you can hover over and element and it jumps to the HTML is killer.
Then you see the CSS class or ID, and you write the CSS selector in the JS console, and it pretty prints the DOM elements. It’s much easier to debug the scraping logic this way.
And then I dynamically create an HTML form and post it to a web server :)
Example here: https://github.com/oilshell/hashdiv/blob/master/stories.js
which I used to write this blog post: http://www.oilshell.org/blog/2021/01/comments-parsing.html (and there will be several more like this, hence the tool)
Ever since I made soupault, I’ve been thinking about whether a “hed” (HTML + sed) tool suitable for one-off HTML manipulation jobs is a logical possibility and whether anyone wants it. Your extraction UI looks good. Manipulation UI usable in one-liners is an open question.
Hypothetically, rename all <blink> elements: cat index.html | pup 'blink | forEach (setTagName "span" | addClass "blink")'.
cat index.html | pup 'blink | forEach (setTagName "span" | addClass "blink")'
I’ve always thought the idea of filtering the page content based on CSS selectors was quite brilliant.
Tools like this definitely come in handy. I was recently curious how my Twitter follower count changed over time so I made this script which scrapes my Nitter profile page (I chose to scrape nitter instead of Twitter because the markup was actually sane). I used the scraper tool to do the HTML parsing and querying. It appears to be similar to the tool linked here but implemented in Rust.
Nice tool, but at this point why not use XPath? It’s more powerful and has been standardized, so there’s libraries available for most languages out there and there’s an implementation right in the browser (document.evaluate()).