1. 33
  1.  

  2. 21

    The problem with the Semantic Web was never finding the right format, or the right schema, or the right ontology language. It was always with the human production of metadata, which is bound to be inaccurate, insufficient, subjective, and shoddy, when not outright lies.

    1. 6

      Yet, we have to create structured data all the time. Shouldn’t we at least try to make it easier to create high quality structured data?

      1. 3

        This is probably a doomed approach. During standardization of HTML5 it was already known, and set as a principle that invisible metadata won’t be reliable. People mostly create content for other people, which means that content for the machines can be easily left outdated, invalid, or intentionally spammy. As long as it doesn’t directly impact human consumers, there’s no inventive to care about accuracy of the metadata.

        Machine learning, especially models like GPT-3, have managed to demonstrate a pretty good understanding of the world based on unstructured plain text. That’s as close as we ever got to databases from the semantic Web, but it has been built from random garbage instead curated ontologies.

        1. 3

          I am not sure who started to spread this rumour that semantic web is just about metadata or that we need people to do data entry manually for billion of hours to “build” the SemWeb. At the very basic level, let’s consider JSON versus JSON-LD:

          1. JSON is a format based on a document model. When combining N JSON responses from N microservices, you are facing the problem of merging N trees, for which CS doesn’t have a single “proper” way of doing it. Merging N graphs is a trivial pair of operations of a union of N node sets and a union of N edge sets. JSON-LD is a standardized format based on a graph data (or, rather, information) model (though marshalled to a document data format, in a standard way, of course). If someone is building a “knowledge graph” or a “data mesh” (mesh is graph where most nodes have a similar node degree), I don’t see how they can claim to do that without using some graph data format.
          2. The core building block of SemWeb is RDF, and two core building blocks for RDF are: a graph data model (see point above) and the use of URIs as identifiers. Not u64’s, not UUIDs. This includes keys. Instead of saying "ssn_no": 123, you should say "https://irs.gov/ns/core#ssn": 123 (as in XML, you can simply use a prefix like “irs:ssn”) and now every microservice in your system but also any microservice from another company has a non-ambiguous knowledge of what the value is representing. I honestly don’t see how this is less reliable than what we have today.

          And finally: garbage in – garbage out.

          1. 4

            AFAIK Semantic Web was supposed to be more about the Web, and connecting and aggregating all of human knowledge from arbitrary sources anywhere on the Web, than just being a verboser flavor of a JSON API.

            But even for just precise data exchange, it’s not that useful. Your ssn example with a simple clean datatype to merge on is the happy case of a most trivial scenario. It’s such a minor problem that it’s easier not to get RDF involved in it.

            When merging data where accuracy matters, the real hard problems are mostly around data quality and deep mismatches between fields (e.g. maybe someone fudged the ssn field to also include passport numbers to internationalize the field). Names and addresses are painful to merge for deeper structural and cultural problems than having uniquely identified field names. Whenever I worked with data, field identifiers were least of my problems. I’ve worked with data where a “country” field contained “airport” due to contractual issues. I’ve worked with data where latitude and longitude were off, dependent on the year they were measured. Merging data from different APIs still requires bespoke integrations due to different authentication systems and API designs, and these will remain fragmented for various technical and business reasons. RDF is a complex solution to a trivial issue that is only a tiny fraction of actual problems around data aggregation.

            And large scale ML models have shown you can have garbage in — useful data out.

            1. 2

              That is true that there were many “blue sky” visions for the Semantic Web. SemWeb experts understood that many old and often conflicting ideas were obscuring the adoption and proposed “Linked Data” as a subset of Semantic Web that would mostly focus on using graph data model, well-curated vocabularies of terms (https://irs.gov/ns/core# in my example with all its classes and properties would be considered a vocabulary), and maybe REST. Sadly, it didn’t go far either (“5-star linked open data” effort was quite well received by librarians, bioinformatics experts, as well as government officials interested in publishing open data – but that was not enough). I have my own opinion and it’s basically that we should have pushed for Linked REST APIs rather than Linked Open Data (the word “open” further scared countless enterprise people).

              I sometimes go back to the TimBL 1994 CERN talk and the SciAm 2001 paper he co-authored in order to “reset” some preconceived notions about SemWeb. In general, I tend to tell developers and decision-makers where we introduce RDF that there were many ideas underpinning SemWeb, many conflicting visions for it and many half-baked technologies that became largely outdated (one could say that almost none of the tools or SDKs are “cloud-native”), and that we should adopt the wise underpinning ideas (graph data model, URIs everywhere), forget about most “visions” and be open to rewriting or repurposing software tools. Another thing I tend to tell is that while many of the SemWeb ideas don’t quite work at the Web scale, they can totally work on the intranet scale over a 1000 services or so, especially with good governance. From the 1994 talk, I see that TimBL wanted to make the Web to be a good place for machine-to-machine (M2M) communication, as good as the Web was for people; I didn’t see a specific call to rewrite the human segment of the internet for machine consumption. It would be silly to ask every tweet to be rewritten as a Description Logic definitions (not to mention that I am quite certain that most of them would represent logical falsehood individually). I also see in that talk that Tim was talking about real-world things being connected on the Web. I think we have achieved both of those visions via REST/microservices on one hand, and IoT on another hand (how satisfactory the solutions are, is another question altogether).

              Now, to the most scathing piece of your criticism – the accuracy of the information and other problems. I wholeheartedly agree with you. However, I see RDF (without anything else that SemWeb brings) as a stepping stone to solving those problems. For example, we are generating documentation for those vocabularies and on the intranet we’ve seen limited success with enforcing the rules (across all services reusing the common properties) along the lines “If you use property X, the value MUST conform to these rules. If you have a property with values that don’t fit the rules, you MUST either convert the value first or create a new property”, which helped spot systems where new properties similar to existing ones had to be created. At least for me, an RDF vocabulary is the first step that helps to bring a bit of clarity, so it’s easier to deal with the issues you’ve outlined. Again, I totally agree that RDF or SemWeb is not a solution to all problems, just a start.

              And for ML, I agree that the achievements have been quite impressive, but I would rather describe it as “lots of useful data, hidden among even more garbage (noise), lots of hard work ==> mostly useful data out in most circumstances”.

              1. 3

                Circling back to the first comment, you can have a very nice spec with “If you use property X, the value MUST conform to these rules”, but making the “MUST” happen is the hardest part, not the spec itself. It’s going to be <input type=text> somewhere. What you’ll get there is inevitably going to depend on who’s using that input box. It may be end users, who won’t understand or care about the requirements (and the business decision will always favor taking poor data over causing friction for customers). It may be some employee who will need to get work done involving some data not covered by the spec, so they’ll stuff it with intentionally-noncompliant data (that’s always easier than getting software or specs changed).

                1. 2

                  What we practically do is to define SHACL shapes on classes and properties (shapes are similar to OpenAPI Schemas, but can be reused across services because they are resources with a URI, just like everything else) and check the data against those shapes when you do POST or PUT on a REST endpoint. If the shape check fails, we return 400 (though 422 is also possible to denote that the request is made correctly and in the right format but the problem is in the contents). There are many people who are trying to automatically convert those shapes into user-facing forms but we think there is too much flexibility that is needed on the frontend to make it happen. I think GraphQL Federation is seeing the problem of reusing the schemas across APIs if we look at the state of practice.

                  Also, SHACL has a mode of not rejecting non-conformant data but instead producing reports of shape mismatch, which is useful if the data comes from a place we have limited control over. Of course, if the employee from your example just writes a small Python script to produce bad data, there is nothing to stop them (which is why I am against what is known in SemWeb as “data dumps”) as you pointed out.

            2. 1

              Here’s my stupid question. Okay, so identifiers are URIs, not u64s or UUIDs. Except, “urn:uuid:00000000-0000-0000-0000-000000000000is a URI. Instead of that, you have “https://irs.gov/ns/core#ssn”. Great, that seems more substantial. But now your URI scheme is http, which means that this resource should have some relationship to the 9 HTTP methods. What is that supposed to be? What are you supposed to get if you GET “https://irs.gov/ns/core#ssn”? If you POST to it? Are there standard answers to these questions or are they still getting worked out?

              1. 3

                Very interesting question! Initially, only URLs were allowed in RDF 1.0, AFAIK. Later in RDF 1.1 (current version), the scope was widened to accept URIs. There are two kinds of URIs: URLs and URNs. URIs only identify resources uniquely, while URLs also specify how to locate them. In general, we consider resources identified by a URL dereferenceable (meaning that the URI contains enough information to retrieve a resource and that URI hosts the representation of that exact resource, not some unrelated resource like an HTML landing page), while URI are not. I say “in general” because one may write a dereferencing service for uri:isbn: scheme, while some http: URLs may return 404s or NXDOMAIN errors. Indeed, there is this specific uuid scheme of URNs, which I see as a (sometimes useful) escape hatch. But I would generally use URNs as a last resort. I usually stay away from URNs.

                For the working with the resources themselves, the common method is GET to dereference and OPTIONS (cf. LDP) to discover what methods does the server support and how to invoke them. But in general, you are supposed to follow the HTTP spec as much as possible, thus SemWeb servers must in general support at least GET and HEAD methods (cf. spec). LDP mentioned before is the basis of the Solid project promoted by TimBL. Sadly, there is no single set of rules, and alternatives to LDP exist, which fragments the landscape quite a bit.

                1. 2

                  The scheme is meaningless. The URL doesn’t even have to exist. It’s just a long string. URI-based namespacing is a completely separate concept from Fielding’s REST.

                  1. 1

                    That seems like a missed opportunity. If the point of the URL is just to be a long, distinct string, then what is the difference between that and a UUID? Like my gut expectation would be that if you go to the URL that names the predicate, then that URL should return some way of validating the triple.

                    1. 1

                      It’s a typical overthinking of the early 2000’s W3C. A URI can be ✨ANYTHING✨, so you don’t want to limit the infinite unfettered extensibility to some dull HTTP API that may not be the best thing 100 years into the future. If you wanted to validate it, then someone could publish an RDF triple with a URI describing another URI that tells you how to validate this URI.

          2. 5

            Who cares. Reality is lossy as hell, the assumption or claim that the semantic web was any different is of course silly, but what we missed out on is super portable data-format that integrates for free and does not require using an overly complicated API to get what you want/need.

            Some people loved to focus on some grand unifying academic whatever thing, and they IMO ruined it for everybody (both critics and enthusiasts - I say that as someone that loved the whole first-order logic stuff of OWL etc), but the programming community completely overlooked the practical potential and opted for harder-to-use “we’re more in control” type of tools. Yes, I am still mourning.

            1. 2

              I have very clear memories of the early 2000s and the surprisingly fast switch from “information architecture” and associated top-down formal processes like Semantic Web technologies, to bottom-up crowdsourced options like tagging as part of “Web 2.0”, to trying to get semantics back with microformats, to people just largely giving up.

              It’s also somewhat amusing to think back on how the RDF-based variants of the RSS spec may have been the literal peak of adoption of Semantic Web formats.

              1. 1

                Yah. Tragedy of the commons is /the/ problem and if you solve the incentives then the rest follows. Still. Using the right format, logic, etc or at least having an idea how to evolve it … that stuff is also important.

              2. 6

                One of the big things missing, IMO, of the Semantic Web is the notion of context. A fact, any fact, is going to be true in some contexts, and false in other contexts. The classic example of this is something like “parallel lines do not intersect” which is true in flat space (2D or 3D), but not (for example) on the surface of a sphere.

                The knowledge bases I worked with (briefly) had encoded facts like “Tim Cook is the CEO of Apple”. But of course, that is only true for a certain context of time. Before that was Steve Jobs from 1997 to 2011. But the dumps generated from Wikipedia metadata didn’t really have that with much consistency, nor any means to add and maintain context easily.

                Context in general is needed all over the place for reasoning:

                • fictional contexts, such as Star Trek or 18th century novels
                • hypothetical contexts, such as: What if I didn’t go to the store yesterday, would we have run out of milk? … and … What if the user doesn’t have a laptop, only a phone, is the website still usable?
                • Time, place, social group, etc.

                A bare fact is only as useful so far as you know what contexts it applies to.

                I don’t know a good way to represent this, a graph may not be ideal. Many facts share a context too. In 2003, Steve Jobs was an employee at Apple. Ditto for Jony Ive in 2003. And in 2004, etc.

                How do our own brains organize these facts and the contexts they go with?

                1. 2

                  There was some with on context. SPARQL lets you say what context you want to query, for example. I forget what it’s called there…

                  n3 also had quoting. It was kind of a mess, though.

                  You’ve seen Guha’s thesis and cyc, yes? Fun stuff.

                  But as noted above, machine learning got-3) has pretty much eclipsed symbolic reasoning on the one side. On the other, we have the stuff that’s no longer considered AI: type systems and sql.

                  1. 2

                    I’ve read a bit about Cyc and OpenCyc, yes. I haven’t read the book Building Large Knowledge Based Systems by Lenat and Guha though.

                    I haven’t given up on the idea of probabilistic symbolic reasoning, but I realized I’m in the minority here.

                    I still imagine a system where, for example, you receive a document (such as news article) and it is translated into a symbolic representation of the facts asserted in the document. Using existing knowledge it can assign probabilistic truth values to the statements within the article. And furthermore be able to precisely explain the reasoning behind everything, because it could all be traced back to prior facts and assertions in the database.

                    I can’t help but think such a system ought to be more precise in reasoning as well as needing fewer compute resources. And being able to scale to much larger and more complicated reasoning tasks.

                2. 5

                  Hiya! Your friendly neighborhood graph database author here, here to fill in a couple comments on what I’ve experienced in the semantic web world over the years.

                  First off, really well done article. “This is why OWL is such a bear” made me laugh a hearty D&D laugh. To some points though:

                  • I agree that “not having a Unique name assumption is a mistake” – very strongly. It’s why I generally recommend that everything in the world starts with a random UUID (forcing a unique name) and we only then start to explicitly say what, in the devil, this entity is; we can add URIs, other synonymous UUIDs, etc, but we do so explicitly. This has a large operational cost, however. If I have two datasets UUID-A is-a person; UUID-A born-on 1961-08-04 and UUID-B is-a POTUS; UUID-B name "Barack Obama" and then I add the link UUID-A sameas UUID-B – querying between equivalent IDs to find, say, birthdays of presidents, requires either a complete inference step on every traversal (to find sets of equivalent IDs) or a potentially unbounded reindexing operation on the addition of sameas links. You can see why this gets hairy on the DB side.
                  • I disagree that the Open World Assumption is ‘serious mistake’ – more neutral. (I agree that having no uniqueness and an open world assumption both together is bad). I see it as kind of a “what sort of query do you want?” question – it’s the two sides of the incompleteness theorem and in some cases you’d want one or the other for your reasoning. Either my reply is consistent, along with inferences, but it may be incomplete– or my response is complete (total over the data I have), but existing data takes precedence, and so I may be slightly inconsistent. I’ve always imagined that, with a really good inference engine, I could flip this assumption at will at query time (and it would infer things somewhat differently as a result, particularly when it comes to universal/existential quantifiers) – but in general, I still think the default state of that switch should probably toward the OWA.
                  • OWL is a mess. It’s well-intentioned – you want to be able to write inference rules in the schema data itself, I think we’d all agree – but the semantics of OWL (so, meta-semantics) are not at all friendly or intuitive. A stripped down OWL like you’d suggest (your subclass example, simply complaining, is a good one) would probably be an open opportunity. :)

                  But OWL being a mess is farther down the iceburg. RDF itself is a mess. You started talking about it from a format perspective, but there’s a big content gap too between RDF and JSON. If I could write an RDF 2.0 spec tomorrow, I’d take the 1.1 spec, evaluate if there’s anything I’d particularly like to change or rip out (probably a few around object value typing and defining ‘subgraphs’ – more on that in a minute) and, in the name of sheer pragmatism, add only the following things:

                  1. An index field. I think JSON got right the minimal set of types you need to get by – numbers (hopefully real ints and floats as opposed to JS “numbers” though), bools, strings, maps (objects), slices (arrays) and… yeah, that’s about it. Of all these, the most painful in RDF is the array. To have an ordered set of objects, ie, a list, in RDF, they act as a linked list, with next and prev links. This is hella inefficient and just weird for most use cases, and simply having an optional index field to set an ordering for a given predicate would be amazing. If it were:
                  my-grocery-list has-item [1] milk
                  my-grocery-list has-item [0] apples
                  my-grocery-list has-item [2] bread
                  

                  instead of

                  my-grocery-list has-item milk
                  my-grocery-list has-item apples
                  my-grocery-list has-item bread
                  apples next-item milk
                  milk next-item bread
                  

                  I’d be a happy camper

                  1. Arbitrary KV pairs on links. Neo4j folks love to talk about being a “property graph” and they have half of a point. You probably don’t want arbitrary KV pairs on your nodes – this provides two ways of doing the same thing and one is less powerful. Sure, I could have a node /en/barack_obama{name: "Barack Obama"} but why not just make that a normal, everyday name edge pointing to a literal string, ie /en/barack_obama name "Barack Obama"? It’s kinda what we’re already building for.

                  KVs on edges however, I can get behind. Fun fact, it’s harder than you’d think to represent an arbitrary filesystem in RDF, if you take into account hard links (without adding loads of indirection). Try it:

                  mkdir foo
                  cd foo
                  touch bar
                  ln bar baz
                  

                  This graph is, in naive RDF:

                  inode-for-foo dirent inode-for-bar
                  inode-for-foo dirent inode-for-bar
                  

                  Twice! Sure, I could use a blank node thing:

                  inode-for-foo has-dirent _A
                  inode-for-foo has-dirent _B
                  _A name "bar"
                  _A inode inode-for-bar
                  _B name "baz"
                  _B inode inode-for-bar
                  

                  And that would work, but now all my links are two-hop has-dirent->inode instead of the direct contains – anyway, imagine the world where I could annotate those links and get the best of both:

                  inode-for-foo dirent {name: "bar"} inode-for-bar
                  inode-for-foo dirent {name: "baz"} inode-for-bar
                  

                  This helps many other scenarios (for what we called “CVTs” back in the day) like marriages and such.

                  1. Optionally, canonical IDs for links themselves – This is an alternative to the arbitrary KV pairs on links that can also have interesting properties. Sure, there’s RDFS and RDF* and similar proposed – and actually adopting one of those might do the trick. It’s a discussion to see if it’s better or worse to be able to annotate links with links as opposed to just adding KV fields – I’m kinda six-of-one-half-dozen-of-another about it, but I can see value both ways. The question comes to, are the annotations inherent to the identity of the link (like in the dirent case) or are they something that, in kind of an open-world way, can be added later? (Like, perhaps, I know that this marriage link between two people occurred, but I don’t know when, but if I find out when they got married later, do I rewrite the link entirely or annotate the link?)

                  — Those would help a heck of a lot.

                  I’ve got more to say on context: ansible-rs makes a very good observation in this comment but I’ll let that simmer a second.

                  1. 3

                    On our podcast, The Technium, we covered Semantic Web as a retro-future episode [0]. It was a neat trip back to the early 2000s. It wasn’t a bad idea, pre se, but it depended on humans doing-the-right-thing for markup and the assumption that classifying things are easy. Turns out neither are true. In addition, the complexity of the spec really didn’t help those that wanted to adopt its practices. However, there are bits and pieces of good ideas in there, and some of it lives on in the web today. Just have to dig a little to see them. Metadata on websites for fb/twitter/google cards, RDF triples for database storage in Datomic, and knowledge base powered searches all come to mind.

                    [0] https://youtu.be/bjn5jSemPws

                    1. 3

                      That this doesn’t mention the hell that was/is RDF/XML troubles me deeply

                      1. 3

                        The first problem with RDF/XML is that people who understand RDF will tell scared newcomers that it’s just XML. Guilty of that myself on projects where developers were demanding JSON at any cost. You are setting yourself up for a world of hurt if you try to work with an RDF/XML graph as an XML tree. RDF/XML is a graph first and XML distant second. Same goes for treating JSON-LD graph as a JSON document (tree).

                      2. 2

                        Is there somewhere I can read this that’s not GitHub?

                        1. 3

                          Vim? Just clone it.

                          1. 1

                            I’m reading this on my phone. Oh well.

                          2. 1

                            It’s a peeve to me too. The line length is way too long, Markdown isn’t very well suited for blogging (too limiting) and see all the distracting Microsoft GitHub UI elements doesn’t help a reader—which is further compounded by Firefox’s Reader Mode not being available because it’s not a blog (but should be).

                            1. 1

                              Reader Mode only works on blogs?! This is a weird limitation (and how does it know it’s a blog??)

                              I use the EasyrReader add-on for Chrome, works ok for this.

                              1. 6

                                https://github.com/mozilla/readability/blob/master/Readability-readerable.js

                                This is what Firefox uses to determine if a page is Reader Mode ready. It looks like it wants some HTML5 semantic elements so it can be fairly sure this is an article and not a e-commerce site, web app, etc. Microsoft’s GitHub is a web app GUI for a Git forge–so of course it shouldn’t be considered an article. Getting your message out there as the first priority makes a lot of sense, but folks should consider rendering their content on a blog if that’s what the content really is (e.g. our linked article is not source code).

                                Chrome

                                No. No to Google. No to Blink hegemony.

                              2. 1

                                Reader Mode in Firefox at least on Linux Desktop is available on GitHub, not on Lobste.rs for example though. A super bad hack on the line length can be resizing the window though (and praying that a particular website doesn’t switch to mobile mode).

                                1. 1

                                  Reader Mode in Firefox at least on Linux Desktop

                                  Interesting. Can confirm, however, Firefox on Android is a sad trombone. News sites and blogs almost always work though.

                            2. 2

                              I’m an amateur coder with training in classical lit, and whenever I get the desire to work on a hobby project of coming up with a better digital editions of classical texts, I find that so much of the XML information that comes up on search engines died out around the late 2000s. This article helps me understand why.

                              1. 2

                                While I must admit I had an initial knee jerk reaction to the use of the word “dead” in the title, this is an excellent article!

                                So much of what I’ve read about semantic web is so full of ultra heady abstractions I can’t even get my arms around what it IS much less why we should want it.

                                I really appreciate the way you walk through some of the existing attempts at a solution and the problems that cause them to fall short of the mark.

                                Also particularly enjoyed this quote:

                                I’m not sure precisely of the right solution to this problem, but I think it genuinely has to change. Innovation in industry is rare and hard since deep tech is so high risk. And innovation in academia is as likely to be pie in the sky as the next big thing, because nobody has the capacity to work through the differences between the two.

                                IMO this is a much bigger problem for the modern world of technology than Semantic web! As much as we innovate in some areas there are also vast swaths of utter stagnation where nothing much changes ever because of this very issue. Broad based change across difficult problem spaces requires engineering hours which are, in a modern capitalist society largely driven by dollars (Yes Virginia, even open source IMO!).

                                I look forward to reading all the referenced articles I shoved into Pocket! Great work!

                                1. 1

                                  So, the hypothetical “new” semantic web should have the unique name assumption, the closed-world assumption, a consistent logic, nice DDL/DML, constraints, views, types, … Great! So, help me Codd.

                                  1. 1

                                    My problem with the SemWeb is all of the above + the lack of executable networks, mostly as a bench test of the formats, query methods and methodology involved.

                                    What’s stopping us from encoding and executing small programs in RDF? What about intelligent agents, encoded and executed as an RDF subgraph, crawling the larger “internet” with their own intentions or ones driven by their dispatching users? (There are a few examples of RDF-based VMs, but that’s another conversation).

                                    From General Magic and Telescript to RDF and its pile of infrastructure, I don’t think anybody really fixated on what it means to put a bunch of computers together and have them share and mine knowledge. You had that in silos, like Google and Facebook, but never generally.

                                    1. 1

                                      The way I am approaching it is to make a name system using reproducible builds, the data you exchange can be executed in the named (i.e. referenced) context. The name system is decentralized so the peer needs to pick the most likely to be true definition of a name … which means that everything is contextual and shared understanding is emergent. The complicated part is building a currency system on that shared understanding so that the most likely to be true definition is also the most valuable one but I think this is emergent from avoiding a central currency in the first place. In the end, I think the ‘web’ part of ‘semantic web’ gets people confused… it’s just layers of code and data.