While this is interesting, a large part of it is due to the publication of the URL by wordpress and the author. I uploaded a new blog template to my site that isn’t linked anywhere. As one might expect, the only visitors are people I directly gave the link to.
Also of interest: I gave the link to one person via Google Hangouts. Google appears to have not indexed it through that link. (There are only 3 IPs in my access log, myself, my phone, and one other that I assume is my friend because I know he read the content)
The most aggressive OVH (and AWS, etc.) bots seem to be the work of people relentlessly crawling HN. Links that don’t make the front page don’t get pummeled nearly as much. Appearing on other sites (Twitter, reddit, ye old lobsters) has a lesser effect.
I’ve been speculating about the precise nature of the entrepreneurial enterprise that requires scraping every HN link hundreds of times, but have so far come up empty. I imagined it’s somebody repacking and wrapping content, but it seems too aggressive. Or they’re just morons.
I’ve been speculating about the precise nature of the entrepreneurial enterprise that requires scraping every HN link hundreds of times, but have so far come up empty. I
This made me think of someone very poorly trying to generate their own rss feed.
Then maybe their tiny rss reader (or whatever) consumes it.
Or they’re just morons.
This probably has a higher likelihood of being true than my above hypothesis.
Doesn’t Wordpress auto-generate the sitemap and robots.txt file when you publish new content?
This is probably partly due to what I perceive as a general move away from RSS/Atom-feeds as a source of metadata to instead see RSS/Atom-feeds as merely one of many ways of finding out about new content and instead relying on the actual URL:s and their metadata to present that content within ones system – a move driven by social media outlets that consumes URL primarily as a direct input, like Facebook and Twitter. Eg. Apple’s new News app in iOS 9 uses open graph data in addition to feed data and I was also part of introducing a similar change into Bloglovin when working there.
There are shared services like http://embed.ly/ that makes it possible to share the metadata results between many consumers – but for bigger services the flexibility and cost-effectiveness of making ones own is probably more beneficial.
I’ve used both approaches at startups I’ve worked – at Flattr we used Embed.ly and at Bloglovin I created the metadataparser that was mentioned in the article. Both times the initial and most apparent use case was to get improved images – as the web, especially after Pinterest gained popularity and smart phones became common, has become increasingly reliant on good images in the design of their UI:s and Open Graph metadata within pages themselves is the way to go to find a good image representation of a URL.
Unless something like the IndieWeb gains huge traction I don’t think the amounts of bots will explode though, as the article author believes. For most small services and free/open source tools it will likely make more sense to utilize something like Embed.ly rather than to spend time on extracting and analyzing all of the metadata oneself. To get good, high quality results one doesn’t just have to extract the data – but also post-process it and filter it against certain rules, something I believe eg. Embed.ly does and probably does better than most will have time to do themselves.