The whole advertising ecosystem is what killed semantic (read: programmable) web. Not only that but with middle-malware like Cloudflare, css injected content and over-abdundance of javascript we stray further and further from programmable web.
This flawed logic of “public data but we want to chose who gets to see it” makes absolutely no sense and is incompatible with the whole idea of public web.
I remember how easy it used to be to write up a web crawler that would notify me about some event on the web. Now you can’t even curl many of modern pages let alone scrape them without investing hours of work reverse engineering javascript made “ajax” requests and getting over “cloudflare and co” web-cancer.
Full disclosure - I work full time at developing various web crawlers and I do enjoy the challenge of this cat-and-mouse game. What bothers me however is the free software and average computer user’s aspect - the programmable web is dying at a quick pace and there’s no longer room for personal web-scraping scripts or web automation.
Now you can’t even curl many of modern pages let alone scrape them without investing hours of work reverse engineering javascript made “ajax” requests and getting over “cloudflare and co” web-cancer.
My personal scrapers are all done in headless chrome these days for this exact reason. I don’t need it to run fast for personal stuff, and my target sites are always going to work in chrome.
Yup Puppeteer community is really growing in that regard!
However the problem is that sharing such scripts is quite complicated and requires major bloat of including whole browser and bunch of javascript hacks with your program and there’s still a lot of uncertainty how the content will be rendered based on so many variables.
While blocking crawlers is an impossible feat it is however very easy to make something difficult to maintain and for public free projects this is a big issue as maintainers don’t grow on trees just yet.
All this unnecessary complexity is making me quite sad and wish for some sort of web revolution.
You seem to especially dislike cloudflare. Is this antipathy for them specifically? Or for all CDN-type services? Or something else?
My only stake in this is that I like the open web. I’m not affiliated with them in any way. I have also been frustrated by barriers to programming the web recently, but would not have implicated cloudflare in any of my frustrations, so I’m curious.
I do have a thing against Cloudflare - well against their “anti bot” products.
They do more than CDN, they also offer “bot protection” which really just requires javascript and does browser fingerprint at configurations setup by site admins. This broke many user scripts and monitors.
It spawned a lot of solutions and fixes but the problem this introduces absurd amounts of maintenance and packaging overhead - suddenly sharing free software is being made so difficult and time consuming. While not really stopping any bots [1]
I actually called out Cloudflare panel at a conference not too long ago and their response was that “it’s up to site administrators” when they are clearly a for profit company that is upselling and pushing this feature as much as they can.
Majority of the website clearly don’t need this obtrusive bot protection but in a world of corporate managers nobody cares: “plug it in just in case” sort of mentality.
Personally I’ve stopped maintaining many free small scripts and monitors because it was just too much work. Suddenly your user’s IP geo-location matters, they need up-to-date nodejs runtime or newest chromium browser…
Is someone curling your page 5 times a day really an issue?
I wish a big anti-competitive hammer would drop down on this. “We want to have our data public and scraped by google and co, embded in twitter and facebook but anyone else can fuck right off!” - digusting.
I was completely missing the bot protection angle. When I thought of them, I was only thinking of the CDN stuff they do and of their DNS. And while I find the latter unappealing personally, it is very easy not to use it, so I was kind of scratching my head trying to figure out why they’d aggravate someone so much.
I guess I’ve been driven to richer scraping stacks (usually headless chrome recently, sadly) by so many common web frameworks’ anti CSRF protections that I hadn’t even noticed their role in the whole thing.
Cloudflare specifically will hide anything resembling email addresses (false positives more often than not) and require you to do a Goog RECRAPTCHA if you, for some odd reason, look at them funny while trying to access a site.
I remember all the semantic web hype from when I was a baby programmer new at a professional job. I rolled my eyes then, and I was mostly right. Mostly.
Stuff like knowledge graphs, hashtags, etc. that are used a lot today did come out of that though, so it wasn’t a total loss of good people’s time.
I think that the semweb stuff is the poster child for what happens when you come up with a good idea, make it too complicated, and then freak out when simpler approaches come along and in so doing doom yourself to irrelevance.
I was just looking at some IndieWeb stuff, Falcon to be specific, and the learning curve is huge for what are relatively simple things in practice. That page and some others I found require a lot more effort than they should probably require. It’s a noble effort, but it’s very hard to get people onboard if they don’t have a lot of energy to apply.
Agree that some users won’t want to get stuck into it - they’re likely also the users who won’t be writing raw HTML for their sites.
So what we’re doing for them is getting Microformats2 support directly into the themes for WordPress, Jekyll, Hugo, etc, so anyone using it can benefit without necessarily doing any work!
As shared in a separate comment in the thread, there’s the Microformats2 specification (see https://microformats.io) which reduces duplication seen with some of the other Semantic Web formats.
Us folks in the IndieWeb (https://indieweb.org) have been using it for some time with great benefit, but it’s always great to hear others reactions too!
The semantic web was all about adding meta-information to the information on the Web so that it would be useable by machines. That turned out to be a lot of work, and then AI started making a comeback and it occurred to people that it would be much easier to just make the machines try to learn to interpret the information instead of us interpreting it for them. So all the effort started to go there instead of to the semantic web.
TFA is not wrong. It correctly lists the many reasons why the semantic web never really got of the ground, it is right to point out that it was always an idealistic dream. But it misses completely that was the rise of machine learning that killed the dream.
Eh I’d disagree with you here. ML presence in the web is absurdly non existant. Even Google crawlers are filled regex patterns rather than ML code.
I don’t think ML is involved in this at all except maybe for web scraping but that only came about because programmable web was never adopted and if anything became less machine accessible than in 2002.
I’m of the impression that nothing happened for the simple reason that the only players who could have made it happen had commercial incentive not to do so.
The whole advertising ecosystem is what killed semantic (read: programmable) web. Not only that but with middle-malware like Cloudflare, css injected content and over-abdundance of javascript we stray further and further from programmable web.
This flawed logic of “public data but we want to chose who gets to see it” makes absolutely no sense and is incompatible with the whole idea of public web.
I remember how easy it used to be to write up a web crawler that would notify me about some event on the web. Now you can’t even curl many of modern pages let alone scrape them without investing hours of work reverse engineering javascript made “ajax” requests and getting over “cloudflare and co” web-cancer.
Full disclosure - I work full time at developing various web crawlers and I do enjoy the challenge of this cat-and-mouse game. What bothers me however is the free software and average computer user’s aspect - the programmable web is dying at a quick pace and there’s no longer room for personal web-scraping scripts or web automation.
My personal scrapers are all done in headless chrome these days for this exact reason. I don’t need it to run fast for personal stuff, and my target sites are always going to work in chrome.
Yup Puppeteer community is really growing in that regard!
However the problem is that sharing such scripts is quite complicated and requires major bloat of including whole browser and bunch of javascript hacks with your program and there’s still a lot of uncertainty how the content will be rendered based on so many variables.
While blocking crawlers is an impossible feat it is however very easy to make something difficult to maintain and for public free projects this is a big issue as maintainers don’t grow on trees just yet.
All this unnecessary complexity is making me quite sad and wish for some sort of web revolution.
You seem to especially dislike cloudflare. Is this antipathy for them specifically? Or for all CDN-type services? Or something else?
My only stake in this is that I like the open web. I’m not affiliated with them in any way. I have also been frustrated by barriers to programming the web recently, but would not have implicated cloudflare in any of my frustrations, so I’m curious.
I do have a thing against Cloudflare - well against their “anti bot” products.
They do more than CDN, they also offer “bot protection” which really just requires javascript and does browser fingerprint at configurations setup by site admins. This broke many user scripts and monitors. It spawned a lot of solutions and fixes but the problem this introduces absurd amounts of maintenance and packaging overhead - suddenly sharing free software is being made so difficult and time consuming. While not really stopping any bots [1]
I actually called out Cloudflare panel at a conference not too long ago and their response was that “it’s up to site administrators” when they are clearly a for profit company that is upselling and pushing this feature as much as they can.
Majority of the website clearly don’t need this obtrusive bot protection but in a world of corporate managers nobody cares: “plug it in just in case” sort of mentality.
Personally I’ve stopped maintaining many free small scripts and monitors because it was just too much work. Suddenly your user’s IP geo-location matters, they need up-to-date nodejs runtime or newest chromium browser… Is someone curling your page 5 times a day really an issue?
I wish a big anti-competitive hammer would drop down on this. “We want to have our data public and scraped by google and co, embded in twitter and facebook but anyone else can fuck right off!” - digusting.
1 - https://github.com/Anorov/cloudflare-scrape/
I was completely missing the bot protection angle. When I thought of them, I was only thinking of the CDN stuff they do and of their DNS. And while I find the latter unappealing personally, it is very easy not to use it, so I was kind of scratching my head trying to figure out why they’d aggravate someone so much.
I guess I’ve been driven to richer scraping stacks (usually headless chrome recently, sadly) by so many common web frameworks’ anti CSRF protections that I hadn’t even noticed their role in the whole thing.
Cloudflare specifically will hide anything resembling email addresses (false positives more often than not) and require you to do a Goog RECRAPTCHA if you, for some odd reason, look at them funny while trying to access a site.
Thanks. That’s exactly the connection I was not making.
I remember all the semantic web hype from when I was a baby programmer new at a professional job. I rolled my eyes then, and I was mostly right. Mostly.
Stuff like knowledge graphs, hashtags, etc. that are used a lot today did come out of that though, so it wasn’t a total loss of good people’s time.
I think that the semweb stuff is the poster child for what happens when you come up with a good idea, make it too complicated, and then freak out when simpler approaches come along and in so doing doom yourself to irrelevance.
I was just looking at some IndieWeb stuff, Falcon to be specific, and the learning curve is huge for what are relatively simple things in practice. That page and some others I found require a lot more effort than they should probably require. It’s a noble effort, but it’s very hard to get people onboard if they don’t have a lot of energy to apply.
Agree that some users won’t want to get stuck into it - they’re likely also the users who won’t be writing raw HTML for their sites.
So what we’re doing for them is getting Microformats2 support directly into the themes for WordPress, Jekyll, Hugo, etc, so anyone using it can benefit without necessarily doing any work!
(originally posted at https://www.jvt.me/mf2/2020/01/an68d/ and hand syndicated)
What simpler approaches came along?
knowledge graphs, hashtags, etc
Oh, I thought there was some other standard I wasn’t aware of.
http://microformats.org/ — embed structured data in normal HTML with minimal fuss.
In some way html is the simpler sgml.
RSS is a way simpler RDF.
As shared in a separate comment in the thread, there’s the Microformats2 specification (see https://microformats.io) which reduces duplication seen with some of the other Semantic Web formats.
You can see an example of a parsing result at http://php.microformats.io/?url=https%3A%2F%2Fwww.jvt.me%2Fmf2%2F2020%2F01%2F2mylg%2F which produces a standardised structure for the resulting JSON, which makes interconnectivity much simpler.
Us folks in the IndieWeb (https://indieweb.org) have been using it for some time with great benefit, but it’s always great to hear others reactions too!
(originally posted at https://www.jvt.me/mf2/2020/01/tw4ug/ and hand syndicated)
Yeah that checks out.
It was eaten by machine learning.
The semantic web was all about adding meta-information to the information on the Web so that it would be useable by machines. That turned out to be a lot of work, and then AI started making a comeback and it occurred to people that it would be much easier to just make the machines try to learn to interpret the information instead of us interpreting it for them. So all the effort started to go there instead of to the semantic web.
TFA is not wrong. It correctly lists the many reasons why the semantic web never really got of the ground, it is right to point out that it was always an idealistic dream. But it misses completely that was the rise of machine learning that killed the dream.
Eh I’d disagree with you here. ML presence in the web is absurdly non existant. Even Google crawlers are filled regex patterns rather than ML code.
I don’t think ML is involved in this at all except maybe for web scraping but that only came about because programmable web was never adopted and if anything became less machine accessible than in 2002.
I’m of the impression that nothing happened for the simple reason that the only players who could have made it happen had commercial incentive not to do so.