1. 10
  1.  

  2. 4

    Web scraping is extracting data from websites by getting them and selecting the relevant parts and present it in a readable or parsable format.

    Web scraping is also often against the TOS for sites with valuable data. Be careful if you profit from it and don’t be surprised if you find your IPs blacklisted or throttled.

    General advice on scraping: if there is an API or licensing route provided, use it! You should respect robots.txt – I am still waiting for the day a scraping tutorial mentions something like Python’s stdlib urllib.robotparser that makes it trivial! Make sure your use of the data falls under fair use of any copyrighted content. You should also crawl at a non-disruptive rate, so you do not breach any contract (in the form of terms of use) or commit a crime (as defined in the Computer Fraud and Abuse Act).

    Some will insist legal agreements aren’t usually given with sufficient notice to hold up in court but clickwraps do and some browsewrap will (they are evaluated individually in court). If your scraper uses some credentials, then you are in more difficult legal territory (automated tool masquerading as a user) as well and have almost certainly gone through a clickwrap TOS agreement. You should also be careful to avoid hot news misappropriation, wherein you scrape data and quickly provide it or competing service for free or a lower cost, thereby taking value form the original source.

    There’s also trespass to chattels and many other legal ways to hoist your own petard – even if you never profit from the scraping or strictly use it for metadata and analytics! There are several cases every year refining our definitions of what’s allowed but the tide has turned in favor of the content owners since the Facebook v. Power case in 2009, rightfully making many startups gun-shy when it comes to scraping.

    Further reading:

    1. 2

      Thanks for your insights on the legal perspective! I have never tried to monetize scraped content, also IANAL so I would not mention any legal part of scraping. What do you think about the LinkedIn lawsuit in August? https://www.infoq.com/news/2017/08/linkedin-ruling-scraping

      I agree that official APIs should be used, that is why I mentioned it in the introduction. Furthermore, I agree with you that services and its users should be respected. It is not okay to disrupt any service and keep users from retrieving the provided information.

      With that in mind I don’t think that it is forbidden to scrape pages which are disallowed in robots.txt, if the information you are looking for is on those pages. AFAIK, robots.txt is only for crawlers and spiders, so they would not end up in irrelevant pages. (See: http://www.robotstxt.org/ and https://en.wikipedia.org/wiki/Robots.txt) But if you are just scraping a specific page for well-defined information, there is nothing wrong with ignoring it, as long as you are not disrupting the service.

    2. 3

      Web scraping is gross but incredibly useful. I’ve put together a cookbook of sorts for scraping with node. Maybe it’ll be helpful to folks just starting out.