1. 63
  1.  

  2. 11

    I wrote this some years ago to scratch my own itch: https://github.com/plainas/tq

    It has been invaluable. I have used it hundreds of time to fetch information from websites and html documents.

    1. 7

      I can’t count how many times I’ve done it using BeautifulSoup in a python REPL. I never bothered to generalize it and wrap it in a command line interface, though. Great idea.

      1. 2

        I made one of these before as well: https://github.com/osener/wring. It has some extra bells and whistles such as scripting, screenshots and optional browser backend (unfortunately puppeteer didn’t exist at the time).

        This sort of thing is pretty useful for statusbars such as i3bar, polybar or xmobar

      2. 10
        1. 4

          This one seems to have more expressive selector syntax than htmlq.

          Edit: Also pup’s “ json{}” output formatter means that if pup isn’t sufficiently expressive for what I need, I can probably pipe its output to jq and get what I want.

          1. 12

            htmlq uses Servo’s CSS selector engine underneath, so I think everything on pup’s readme should work with htmlq. For example:

            curl -s https://news.ycombinator.com/ | htmlq 'table table tr:nth-last-of-type(n+2) td.title a'
            

            If you do this instead you can just get the links themselves rather than the entire element:

            curl -s https://news.ycombinator.com/ | htmlq 'table table tr:nth-last-of-type(n+2) td.title a' --attribute href
            

            Full disclosure: I wrote htmlq, and am in envy of pup’s README

            1. 1

              Hah, nice! Thank you.

        2. 9

          This is nice. Though, you could do some of the same things with XPath. For example, the first two examples could be done using xmllint:

          $ curl -s https://www.rust-lang.org/ | xmllint --html --xpath "//*[@id='get-help']" -
          $ curl -s https://www.rust-lang.org/ | xmllint --html --xpath "//@href" -
          

          Unfortunately xmllint doesn’t support html5, and complains about the <header> and <main> tags in the above example.

          1. 5

            XPath is really powerful, but it’s also really hard to grok (because it was designed for a much more powerful use case). I wonder if something that used the JQuery selector format wouldn’t be more appreciated :)

              1. 2

                Hxselect is super handy

                1. 2

                  This tool is great, and already in Debian. Interesting that you need to give it XML syntax, but it comes with tools hxclean and hxnormalize -x that solve that for you.

                  1. 1

                    Wow! Fantastic, thanks!

              2. 4

                Also: https://github.com/coderobe/hq

                And for others looking for something a bit different, xml2 is a package that converts xml/html structures to a flat directory structure, and viceversa.

                1. 3

                  I made something slightly similar:

                  https://github.com/charmparticle/xpe

                  It lets you execute xpath2 queries against stuff, including html.

                  1. 3

                    This might be one of the coolest things I’ve seen in awhile.

                    1. 2

                      Nice! Could be useful in many scenarios. E.g. for exporting data from a Wordpress site. A while ago I did a little project doing just that, the goal was to move articles from Wordpress to Hugo. I didn’t like the ready-made solutions I found so I ended up writing a custom thingy in Bash instead. htmlq could be very useful for stuff like that.

                      1. 1

                        Very cool, I’ve needed to use much more complicated tools just to find some information in webpages.