1. 24

  2. 23

    I love these kinds of blub studies, where someone shares their techniques for improving a specific category of task. One scraping trick he didn’t talk about: cache the results. Save every scraped page to a separate file or database field or whatever, and then run the processing on that. That way if something fails, you don’t have to hit the site again, and you can try a bunch of different processing ideas more quickly.

    1. 9

      To add to that: if you’re maintaining a cache, then checking if the page has changed since your last scrape is something you get basically for free and you can skip the whole of the rest of the pipeline on that page if it hasn’t.

    2. 3

      One of the things I like to do is help people get their data into systems they control and can manipulate (or pay someone to manipulate) for analysis. That hasn’t been my primary job throughout my career, but it is one that I enjoy so much when the opportunity presents itself that I have even been known to do it pro bono occasionally.

      The times I’ve done it lately, I have found that housing my scraper inside a django application affords significant creature comforts that I really enjoy. I wrote a stupid little cookiejar class that sits on the django ORM, has a context manager, and works with requests.

      Using shell_plus it’s a lot of fun to drive my scrapers from a REPL to iterate on a stage of my pipeline as I’ve broken it into parts.

      And there are the obvious comforts like the free UI the admin panel gives you and the low boilerplate command line wrappers from the management commands.

      Like hwayne said, caching the results of your remote fetches is extremely useful. Before I know what I want to cache, I use a caching proxy. That makes it so much easier to iterate on my processing without being rude to my source. Once I know, I have a couple of namespaces in my database that are for different stages of imports. One has raw upstream data, and one has tables that are for unprocessed data where every field is a string. Having that second stage there lets me make postgres do a lot of work that I used to write code for. And it gives me a second spot to replay imports without re-paying for the scraping.

      1. 2

        There’s an article with a much broader scope that I really enjoy, Building data liberation infrastructure by @karlicoss, that discusses pros and cons with many methods (and reasons with examples against many different kind of APIs).