1. 10
  1.  

  2. 5

    This post was made roughly a month ago.

    After experimenting with the various options for headless web crawling:

    • Due to it’s command-line flags, Headless Chrome is probably the best for extremely basic web crawling for use cases where you only need to render the page with JS, but don’t actually need to cause any custom interactions with the page directly (e.g., just execute the webpage’s built-in JS and then get the resulting DOM as HTML).

    • Despite no longer being maintained, PhantomJS is a very solid setup for smooth/minimalist web crawling with fairly light scripting/JS interaction. In contrast to Headless Chrome, it’s easy to interact with webpages without bringing in external libraries like Selenium. E.g., if you want to load a webpage, press a button, and then get the resulting HTML.

    • Headless FireFox is solidly adequate. There were no specific areas that stood out while using it. PhantomJS feels lighter-weight and quicker to get up-and-running for small-to-medium sized scripts, and the Headless Chrome CLI was superior for small/quick one-liners; if you’re writing a medium-to-large size web crawling tool, those considerations are likely going to be less relevant, and in such cases Headless Firefox probably works just as well as the others.

    1. 6

      I do a fair bit of low-volume, high-fidelity scraping.

      I wouldn’t recommend PhantomJS any more, because if the page you’re scraping happens to use any recent JS features (eg ‘does not work on IE10’) it’s not going to work on PhantomJS either.

      You probably already have chrome or firefox installed on your computer. The selenium bindings are not that heavyweight, and using a real browser makes a huge difference in the amount of time you spend faffing about with obscure script errors.

      Grab the selenium bindings for your favorite language (there’s a ton of options) - I’m partial to the capybara DSL for ruby - and go to town.

      For instance, childcare posts photos of my kids using an app that (among other things) sets up your view based on a session adjusted by making GET requests. The images are only displayed using a JS carousel. This ruby/capybara script lets me archive them for myself:

      visit 'https://web.myxplor.com'
      
      fill_in 'email_address', with: '<REDACTED>'
      fill_in 'password', with: '<REDACTED>'
      click_button 'Login'
      
      # Select which kid to get pictures of, via a GET request. Yuck.
      visit "https://web.myxplor.com/parent/child_timeline/<REDACTED"
      
      # Go to the page with the pictures
      visit "https://web.myxplor.com/observations"
      
      # Load all the pictures
      while page.has_button?('Load More')
          click_button 'Load More'
          sleep 4 # If it's stupid but it works, it isn't stupid...
      end
      
      all("a.fancyboxNew").each do |link|
          outfile = sanitize_filename(link['title']) '.jpg'
      
          # Don't re-download files
          next if File.exists?(outfile)
          `wget -O #{Shellwords.escape outfile} #{Shellwords.escape link['href']}`
          $?.success? || warn("Failed to fetch #{link['href']}")
      end
      
      
      1. 2

        The use cases that I’ve been using PhantomJS for consist primarily of:

        • One-off scripts (i.e., that aren’t intended to be maintained beyond like three days into the future);
        • That are under 50 lines; and
        • That are generally being used as part of Bash/shell scripts where input is piped into other shell utilities

        Even with the flood of comments from the previous thread noting PhantomJS’ limitations, it’s still the cleanest and easiest setup for the use case and scope of those scripts.

        Selenium is easy enough. And there’s an argument to be made that, if recommendations were being made to a novice programmer, it might be better to use that kind of setup even for the use cases mentioned above, since they might be unfamiliar with the limitations of using a library that is no longer being maintained.

        Regardless, PhantomJS has continued to be the most efficient setup for a number of lightweight use cases, particularly where the main concern is setup time, rather than task-complexity or long-term maintainability.