1. 4

    Web scraping is extracting data from websites by getting them and selecting the relevant parts and present it in a readable or parsable format.

    Web scraping is also often against the TOS for sites with valuable data. Be careful if you profit from it and don’t be surprised if you find your IPs blacklisted or throttled.

    General advice on scraping: if there is an API or licensing route provided, use it! You should respect robots.txt – I am still waiting for the day a scraping tutorial mentions something like Python’s stdlib urllib.robotparser that makes it trivial! Make sure your use of the data falls under fair use of any copyrighted content. You should also crawl at a non-disruptive rate, so you do not breach any contract (in the form of terms of use) or commit a crime (as defined in the Computer Fraud and Abuse Act).

    Some will insist legal agreements aren’t usually given with sufficient notice to hold up in court but clickwraps do and some browsewrap will (they are evaluated individually in court). If your scraper uses some credentials, then you are in more difficult legal territory (automated tool masquerading as a user) as well and have almost certainly gone through a clickwrap TOS agreement. You should also be careful to avoid hot news misappropriation, wherein you scrape data and quickly provide it or competing service for free or a lower cost, thereby taking value form the original source.

    There’s also trespass to chattels and many other legal ways to hoist your own petard – even if you never profit from the scraping or strictly use it for metadata and analytics! There are several cases every year refining our definitions of what’s allowed but the tide has turned in favor of the content owners since the Facebook v. Power case in 2009, rightfully making many startups gun-shy when it comes to scraping.

    Further reading:

    1. 2

      Thanks for your insights on the legal perspective! I have never tried to monetize scraped content, also IANAL so I would not mention any legal part of scraping. What do you think about the LinkedIn lawsuit in August? https://www.infoq.com/news/2017/08/linkedin-ruling-scraping

      I agree that official APIs should be used, that is why I mentioned it in the introduction. Furthermore, I agree with you that services and its users should be respected. It is not okay to disrupt any service and keep users from retrieving the provided information.

      With that in mind I don’t think that it is forbidden to scrape pages which are disallowed in robots.txt, if the information you are looking for is on those pages. AFAIK, robots.txt is only for crawlers and spiders, so they would not end up in irrelevant pages. (See: http://www.robotstxt.org/ and https://en.wikipedia.org/wiki/Robots.txt) But if you are just scraping a specific page for well-defined information, there is nothing wrong with ignoring it, as long as you are not disrupting the service.

    1. 1

      Do you run your own, OP?

      1. 1

        Yes, on my local machine. :)

        1. 2

          I dont get it. According to searx home page

          Users are neither tracked nor profiled.

          Since you’re probably the sole user of your local machine, arent all the aggregated search engines able to track you?

          1. 1

            No, because all session info are cut from requests. Only the IP address is visible, but it is not possible to profile users based on IPs. If someone wants to hide his/her IP, he/she can configure proxy for searx or use it over Tor. Also, searx does not store session info and does not profiles its users.

            1. 2

              it is not possible to profile users based on IPs

              Why not? I don’t think it would be difficult (especially for the likes of Google) to build profiles based on IP, or to differentiate searches from shared IPs, or to make best guesses (based on similar search terms using the same ISP for example) at who someone is if their dynamic IP changes.

              If it isn’t possible, and maintaining session info is the only way Google can profile you, why not just use a browser addon to delete any cookies and rewrite the URLs to remove the unique identifiers?

              If someone wants to hide his/her IP, he/she can configure proxy for searx or use it over Tor.

              Using Tor could help, but how many searx setups are configured to do this? You wouldn’t want to be the only one hitting certain APIs over Tor, as then any requests to those that comes from a Tor exit is almost guaranteed to be you.