I propose a reverse search engine on robots.txt. First read the robots file and ONLY store data on blacklisted listings.
I was talking about this with my workmates the other day. I think you could crawl archive.org and possibly google cache to find “we couldn’t archive this because of robots.txt”.
Meh. I have entries in robots because they are redundant views and mostly a waste of time. Scrape it if you like, but that just means you’re more likely to miss important content. Whatevs. It’s your archive.
I had a similar thought. robot.txt is fine if it is here to improve the robots work, rather than “hiding something from the rest of the word.”
I think that’s how most people treat it. Not as a cloaking device but as “hey Googlebot here’s me explicitly telling you these URLs are low value scutwork not worth your time and which would add 30k trash pages to my index”. Google essentially rewards you for doing that by lowering your ranking if you have a large volume of “low value” pages in your index.
One site I can think of whose bad usage of robots.txt caused some interesting historical stuff to be lost is the old kuro5hin.org (a community that one might roughly describe, at least during its good years, with the analogy kuro5hin:Slashdot::lobsters:HN). It had a robots.txt that, if I recall correctly, explicitly allowed Googlebot but disallowed everyone else. As a result it’s not in the Internet Archive’s Wayback Machine (which respects robots.txt), which has caused some fairly influential early-2000s essays to disappear from the internet.
I think it is sad that archive.org and to a lesser extent google cache etc. aren’t actually archiving the whole internet. It obviously isn’t that useful to archive utility sort of pages rather than actual content, but many sites “abuse” robots.txt because they are worried about copyright, world wide access to data, backups they don’t control, etc. which rightly or wrongly means that archive.org isn’t close to a perfect archive.
Precisely one reason comes to mind to have ROBOTS.TXT, and it is, incidentally, stupid - to prevent robots from triggering processes on the website that should not be run automatically. A dumb spider or crawler will hit every URL linked, and if a site allows users to activate a link that causes resource hogging or otherwise deletes/adds data, then a ROBOTS.TXT exclusion makes perfect sense while you fix your broken and idiotic configuration.
Another reason comes to mind: sites with a lot of procedurally generated URLs that don’t use many resources and are read-only. SCM web interfaces for example. robots.txt can suggest that spiders don’t waste their time and clutter their index by crawling them.
Of course, Archive Team sees robots.txt as an affront, so they’re not going to recognise that someone might be doing the spider a favour by using it, so if they want to crawl, say, every conceivable permutation of a URL in the linux kernel’s cgit for example, more power to them.
I always figured it’d just be nice to let the botnet know where the non-content pages, like the login page, are.
But what if I want to search for the login page? I’ve actually searched for “site name login” for sites that didn’t have obvious enough login buttons.
Enough mistakes have been made at that point that changing your robots.txt is simply scratching the surface!
I hate generalization, and this is a perfect example of it. Our public instance of source code management uses robots.txt to prevent bots on hammering large repos like GIT or CPython. There are still many good examples of usage of robots.txt, and saying a general NO to them is not a good change.
[Comment removed by author]