As a markup language, the main problem is that it is not friendly for human editing. It’s not robust against common typos and it’s verbose enough that they are easy to make. That’s less annoying if you use an XML editor but I’ve never found a good one.
The main complaints about XML have nothing to do with this though, they are to do with the complexity, XML entities alone are a nightmare to parse safely. XML namespaces are nice in principle (being able to say ‘now I am switching to another XML dialect until the closing tag’ is great) but the renaming magic that they allow is terrifying.
The big problem is that it tries to be a rich generic document language, a text markup language, and a structured object model all at once and when these goals invariably come into conflict it adds complexity to address the differences. This complexity, in turn, means that almost nothing can parse XML. I wrote a parser for the tiny subset that XMPP needs but a complete parser is huge and complicated. This means that interoperability between things that actually use more than a tiny subset of XML features is more or less nonexistent.
Isn’t it? Every opening tag has a matching closing tag. If you forget a </p>, your linter can tell you right away. And (devil’s advocate) the rules are simple, unlike Common Mark.
We know from prior experience that humans and xml are not compatible. There are just too many “xml” standards where humans would write stuff and it would be invalid, even things like rss that were ostensibly xml from day 1.
The problem is that xml by humans means requiring a human to be as flawless as a xml parser, which is honestly making unreasonable expectations of both :)
And as a structured object model it is poor because it can’t include bytestrings, which are important for a lot of use cases. (Except by base64 or similar embeddings, which inflate the data and take time to parse.)
It feels like some slightly stricter subset of HTML5 would work OK as a “modern XML” ?
Ironically HTML5 has the opposite problem – any typo is valid, and the browser guesses an often wrong interpretation of your document. I’m thinking about what happens when you mess up the balancing of <table> <tr> <td> – you often get something weird.
CSS selectors are also a pretty nice complement to HTML that are widely used and understood, and I’m not sure what the XML equivalent is, but it’s probably not better.
Which also comes with the “oh, this document uses namespaces… I guess I’ll google how to handle those again”. But apart from that, at least xpath has more features like partially matching the text inside a node.
Ironically HTML5 has the opposite problem – any typo is valid, and the browser guesses an often wrong interpretation of your document
So this is not true. A big part of the html5 process and the related ES3.1/5 process was fixing the horrors of the 90s and 00s. I recall us having to spend a lot of time working out how html was actually parsed and the DOM trees built to ensure we could actually define things. The end result of all of this is not necessarily the nicest language specification (in an academic grammar purity sense), but there is no guessing involved. The parsing of everything is exactly specified today, the result of typos, mismatched tags, etc is all well defined.
It was the same in JS the “specifications” that the 90s and 00s provided were shitty: they were either incomplete or ambiguous, or represented what some spec writer wanted rather than what happened (presumably as they were all retroactively specified by people who knew what they wanted to be the case, and thought telling people that their existing content was now wrong was somehow a good idea)
Right that’s actually what I meant. As far as I know there are no syntax errors in HTML5, and every byte sequence is valid.
Except it’s probably hard for a human to predict what document many of those byte sequences produce, e.g. things like
<table> a <tr> b <td> c </td> d <table> </td> e <tr> f </td> g
and so forth
I think that decision to define everything (I guess based on exhaustive testing of dominant implementations) was probably right given the state of the web …
But it does mean that it’s harder to write HTML by hand, because you don’t have any feedback about mistakes in the doc. In practice what people do is just fix the ones that “make it look weird”.
It’s not that loose in practice. There’s the w3c validator https://validator.w3.org/nu/#textarea and for some things it does complain a lot. For example for your fragment:
Error: Misplaced non-space characters inside a table.
Fatal Error: Cannot recover after last error. Any further errors will be ignored.
The downside of HTML is that any typo still makes a valid doc. The browser
often guesses what you mean in a weird way (i.e. non-balanced tags)
That’s a pretty big downside. Also, HTML is now defined by a “living”
standard put forth by the browser cartel, AKA WHATWG. In other words,
HTML is what the big browser makers say it is. That’s a poor foundation
on which to build.
All of that querkyness in HTML, like the no closing tags and so forth,
is a relic of bygone days when most HTML was written by hand. It made
the format friendlier to human authors. I really don’t want to ever
write XML or HTML by hand in the general case. Doing so is the
equivalent of manually speaking SMTP to a mail server. Sometimes I
do manually speak SMTP to mail servers for debugging and such, but for
the most part, that’s my MUA’s job, not mine.
It does however have the benefit of matching what actually happens on the web. You can argue it’s bad, but the alternative is XML, and xhtml, which have repeatedly failed due to the strictness requirements vs human editing.
Also, HTML is now defined by a “living” standard put forth by the browser cartel,
Well that’s nonsense. Any one can make proposals and contribute to the various html specs, which pre html5/whatwg they could not.
In other words, HTML is what the big browser makers say it is.
I mean yes: if the browsers don’t implement something, then the fact that it’s in a “spec” is irrelevant. If the browser do implement something, then you want it to be specified so you don’t recreate the pre-whatwg version of html, where the spec does not match reality and cannot be used to write a browser.
That’s a poor foundation on which to build.
I’m sorry, this is complete BS. Before the WHATWG and the “browser cabal”, what we had was a bunch of specifications that were incomplete, ambiguous, and often times just outright incorrect. The entire reason WHATWG came to exist is because groups like W3C, ECMA, etc would define what they wanted things to be, and not what they actually were. You want them to be “living” because what exactly do you think the alternative is? Again, we’ve had static “versioned” specs were bad.
I get that many people have started their careers in the last decade or so, and I think that means that they are unaware of what the pre-whatwg world was like, and the costs that came with it. Similarly there are strange beliefs about how these standards bodies actually or operate, which can only come about from actually ever interacting with them (and certainly without ever trying to participate in the pre-whatwg bodies, because you generally needed to be an invited expert or a paying corporate member which seems less open than the apparently evil cabal)
HTML isn’t extensible. It’s an object model customized to displaying content in a browser. You could technically replace all markup usages of XML with HTML but it gets ugly fast
docx, pdf — uses XML to create a different object model that’s meant for being printed
translating a work into lots of languages
annotating music or lyrics
Markup isn’t necessarily for display, it’s adding data to text
I guess I’m mainly talking about the syntax of HTML vs. the syntax of XML.
I know XML has DOM and SAX APIs. Lets leave out DOM altogether
If you just consider a SAX API (even driven) then I don’t see a huge difference between XML and HTML, besides syntax (and all the optional elaborations that are only used by certain apps ?)
I’m asking what libraries people use because maybe XML has a bunch of great libraries that make things easier than using HTML, but so far I am not aware of the benefit
Isn’t HTML specifically about using a specific set of tags? Like if you are writing HTML with a bunch of custom tag names you’re actually writing XML and not HTML?
I think that was the theory with XHTML, but it never happened in practice. XHTML meant that HTML was just XML with specific tags.
But we don’t use XHTML anymore – we use HTML5, which is more compatible with HTML 4, HTML 3, etc.
All browsers have always ignored HTML tags they don’t understand, because otherwise HTML could never be upgraded … e.g. I just tested this and it just shows “foo”. Maybe it makes it behave like <span> or something.
Fun fact, if you put a hyphen in a tag name in html, it’s specifically not a spec tag, so you can create your own formatting around custom tags instead of classes
That’s not quite correct. All tags are explicitly valid, what changes is whether a tag has
Additional non-display semantics (e.g. , , , ….)
Builtin non-default styling - e.g. , (lol!), etc tags that predated that ability to style/layout that isn’t explicitly built in to the browser
But fundamentally the tag name of an element is not important - the DOM APIs are generic string base document.getElementsByTagName, and CSS’s sigil free (so arguably “default”) identifier matches the tag of an element.
The rise of JSX shows that actually XML-ish syntax is a pretty good way to describe a document. It’s just that people used XML for things that aren’t documents, and that was a mistake.
XML is precisely what it says on the tin: an extensible markup language.
🤯, everything makes so much more sense now.
And for what it is, there is truly no replacement. Every other markup language supports only a limited set of markup directives defined from the factory.
This bit is not entirely true. AsciiDoctor has “generalized container with attributes”:
Maybe the grass is greener, but after many years of hate I kind of miss the old XML. It’s a pain to find a widely, consistently supported format that’s text based with comments. XML has good schema languages and the namespace based extensibility is really very nice - something I miss elsewhere.
The ~document chaos is real, but it’s terminological chaos from before computing that turns into trouble when you’re trying to mark it up.
Before computers, you could still reasonably use the term document to refer to things as different as an essay, poem, book manuscript, CV, patent, birth certificate, voter-registration form, and a court transcript.
It would be nice to cleave them apart, but that’ll take a lot of work since English has a dearth of good terms for discriminating between highly-structured and mostly-unstructured documents.
In those terms, the markup depicted in the linked piece’s “document” is basically all “structural” markup–and I feel like this is the least interesting part of an unstructured document. It’s the same logic that translates a paper address-change form into structured data describing what’s in the fields–but it doesn’t yield any of the leverage that we get from the meaningful associations in the form-as-markup.
We can’t do much more with the structural markup of a free-form document than present it and perhaps translate it to other formats with similar idioms. Structure in this kind of document is somewhat capricious (if five different writers wrote exactly the same 20 pages of text, they might all still use different section/paragraph/sentence boundaries).
The bigger leverage in free-form documents comes from ontological markup that annotates what’s being written about. This is what’s going to enable you to bolt on an interesting extension that your readers can use to jump between every section on your site that discusses the same paper or author. Or enable you to automatically inject birth/death/release dates for people, films, and albums you refer to.
(I don’t mean to suggest the post precludes these–but the verbosity comparison feels much less fair without enough inline annotation to provide a similar level of utility as the star record.)
I sometimes think that XSL has been unfairly overlooked. It’s no fun to write, but the functionality is neat. If your data is already XML, you can use XSL to transform and format it for human readability. Sure, you can write another program to transform your data, but XSL is more descriptive. The XML file is still there if you need something machine-readable, but you’ve enhanced it.
Unfortunately the syntax is grim, the mental model is complicated, and it’s not something people need all that often.
I used to argue that XML was awful and that the whole world should just switch to JSON. Then I used JSON for a bit and hated the lack of tooling for addressing it. Then jq came a long and I liked it, but then when humans tried to edit it, I hated it again. Then YAML came long and HOCON and RON and TOML and a host of others, all modest improvements upon or specializations of their predecessors.
But few meet the solid objective of XML: a document format, wherein the meat is more plentiful than the bones.
Don’t put a document in an object notation format. Structure metadata well inside the document and its overhead will be less frustrating.
I’m now ~6 years removed from working on a product that was XML, XSL, XPath, and friends from the ground up with its bases in the early 2000s C world from academia. It was Done Right if you ask me, a Search engine product that treated data as documents, not objects. Working with JSON from that stack was hard, but that was during a time of transition before tools that could convert JSON to XML with some kind of convention and vice versa were available.
And I’ve not touched XML since, really. I’ve almost redone my LaTeX resume in XML a few times now, but keep letting other things far more impactful to my life take precedence. JSON Resume is cool, but the mockups I’ve done in XML always felt better.
XML was heavily pushed by IBM during the early 1990s as a way to replace CSV files, fixed width data files, and “just dump the record to disk” formats. It outperforms these in every measure but size. As the IBM model at the time was to provide consultants to a zoo of hardware and software, this allowed a mechanism to slowly wean these systems off legacy data formats. As is often the case, the usual suspects came out to push for XML as a contract negotiation system, a universal solution for some imagined problem, etc. For some time, IBM had some proprietary compression schemes lower in the network stack.
That said, XML has tricky corners and some line noise is valid XML.
It’s a markup language with a completely uniform syntax so that the alphabet of markup elements is customizable. And for what it is, there is truly no replacement. Every other markup language supports only a limited set of markup directives defined from the factory
I feel like stuff like Jinja, Scribble, ReST all fall into this category as well. I have a hard time imagining using those in cases where I’ve used XML though, because XML by default is spitting out structured output. It’s not that you can’t get an AST of sorts from these other tools, but…
As a markup language, the main problem is that it is not friendly for human editing. It’s not robust against common typos and it’s verbose enough that they are easy to make. That’s less annoying if you use an XML editor but I’ve never found a good one.
The main complaints about XML have nothing to do with this though, they are to do with the complexity, XML entities alone are a nightmare to parse safely. XML namespaces are nice in principle (being able to say ‘now I am switching to another XML dialect until the closing tag’ is great) but the renaming magic that they allow is terrifying.
The big problem is that it tries to be a rich generic document language, a text markup language, and a structured object model all at once and when these goals invariably come into conflict it adds complexity to address the differences. This complexity, in turn, means that almost nothing can parse XML. I wrote a parser for the tiny subset that XMPP needs but a complete parser is huge and complicated. This means that interoperability between things that actually use more than a tiny subset of XML features is more or less nonexistent.
Isn’t it? Every opening tag has a matching closing tag. If you forget a
</p>
, your linter can tell you right away. And (devil’s advocate) the rules are simple, unlike Common Mark.We know from prior experience that humans and xml are not compatible. There are just too many “xml” standards where humans would write stuff and it would be invalid, even things like rss that were ostensibly xml from day 1.
The problem is that xml by humans means requiring a human to be as flawless as a xml parser, which is honestly making unreasonable expectations of both :)
And as a structured object model it is poor because it can’t include bytestrings, which are important for a lot of use cases. (Except by base64 or similar embeddings, which inflate the data and take time to parse.)
It feels like some slightly stricter subset of HTML5 would work OK as a “modern XML” ?
Ironically HTML5 has the opposite problem – any typo is valid, and the browser guesses an often wrong interpretation of your document. I’m thinking about what happens when you mess up the balancing of
<table> <tr> <td>
– you often get something weird.CSS selectors are also a pretty nice complement to HTML that are widely used and understood, and I’m not sure what the XML equivalent is, but it’s probably not better.
XML has XPath, which is harder to use than query selectors, IMO.
Which also comes with the “oh, this document uses namespaces… I guess I’ll google how to handle those again”. But apart from that, at least xpath has more features like partially matching the text inside a node.
So this is not true. A big part of the html5 process and the related ES3.1/5 process was fixing the horrors of the 90s and 00s. I recall us having to spend a lot of time working out how html was actually parsed and the DOM trees built to ensure we could actually define things. The end result of all of this is not necessarily the nicest language specification (in an academic grammar purity sense), but there is no guessing involved. The parsing of everything is exactly specified today, the result of typos, mismatched tags, etc is all well defined.
It was the same in JS the “specifications” that the 90s and 00s provided were shitty: they were either incomplete or ambiguous, or represented what some spec writer wanted rather than what happened (presumably as they were all retroactively specified by people who knew what they wanted to be the case, and thought telling people that their existing content was now wrong was somehow a good idea)
Right that’s actually what I meant. As far as I know there are no syntax errors in HTML5, and every byte sequence is valid.
Except it’s probably hard for a human to predict what document many of those byte sequences produce, e.g. things like
and so forth
I think that decision to define everything (I guess based on exhaustive testing of dominant implementations) was probably right given the state of the web …
But it does mean that it’s harder to write HTML by hand, because you don’t have any feedback about mistakes in the doc. In practice what people do is just fix the ones that “make it look weird”.
It’s not that loose in practice. There’s the w3c validator https://validator.w3.org/nu/#textarea and for some things it does complain a lot. For example for your fragment:
In practice, I believe validator use is low to nonexistent
I tried to use them a few years ago for my website, and they were in very poor shape
Not to mention that W3C itself is unfortunately a lagging indicator these days, of WHATWG and HTML5
I also really like XML! It has two more things that make it really good for text markup:
<a>lik<b>e thi</b>s</a>
.I used both features to cut a week off the delivery time of learntla, in a way that I couldn’t have done with JSON, YAML, or even rST.
Could you have done this with HTML?
I think HTML has all the nice properties of XML, without a lot of weird stuff like XML namespaces
The downside of HTML is that any typo still makes a valid doc. The browser often guesses what you mean in a weird way (i.e. non-balanced tags)
It feels like HTML is too liberal, and XML is too strict, although I only really use HTML.
What libraries did you use?
That’s a pretty big downside. Also, HTML is now defined by a “living” standard put forth by the browser cartel, AKA WHATWG. In other words, HTML is what the big browser makers say it is. That’s a poor foundation on which to build.
All of that querkyness in HTML, like the no closing tags and so forth, is a relic of bygone days when most HTML was written by hand. It made the format friendlier to human authors. I really don’t want to ever write XML or HTML by hand in the general case. Doing so is the equivalent of manually speaking SMTP to a mail server. Sometimes I do manually speak SMTP to mail servers for debugging and such, but for the most part, that’s my MUA’s job, not mine.
It does however have the benefit of matching what actually happens on the web. You can argue it’s bad, but the alternative is XML, and xhtml, which have repeatedly failed due to the strictness requirements vs human editing.
Well that’s nonsense. Any one can make proposals and contribute to the various html specs, which pre html5/whatwg they could not.
I mean yes: if the browsers don’t implement something, then the fact that it’s in a “spec” is irrelevant. If the browser do implement something, then you want it to be specified so you don’t recreate the pre-whatwg version of html, where the spec does not match reality and cannot be used to write a browser.
I’m sorry, this is complete BS. Before the WHATWG and the “browser cabal”, what we had was a bunch of specifications that were incomplete, ambiguous, and often times just outright incorrect. The entire reason WHATWG came to exist is because groups like W3C, ECMA, etc would define what they wanted things to be, and not what they actually were. You want them to be “living” because what exactly do you think the alternative is? Again, we’ve had static “versioned” specs were bad.
I get that many people have started their careers in the last decade or so, and I think that means that they are unaware of what the pre-whatwg world was like, and the costs that came with it. Similarly there are strange beliefs about how these standards bodies actually or operate, which can only come about from actually ever interacting with them (and certainly without ever trying to participate in the pre-whatwg bodies, because you generally needed to be an invited expert or a paying corporate member which seems less open than the apparently evil cabal)
HTML isn’t extensible. It’s an object model customized to displaying content in a browser. You could technically replace all markup usages of XML with HTML but it gets ugly fast
Markup isn’t necessarily for display, it’s adding data to text
What’s not extensible about HTML?
That was the idea behind microformats
https://developer.mozilla.org/en-US/docs/Web/HTML/microformats
I guess I’m mainly talking about the syntax of HTML vs. the syntax of XML.
I know XML has DOM and SAX APIs. Lets leave out DOM altogether
If you just consider a SAX API (even driven) then I don’t see a huge difference between XML and HTML, besides syntax (and all the optional elaborations that are only used by certain apps ?)
I’m asking what libraries people use because maybe XML has a bunch of great libraries that make things easier than using HTML, but so far I am not aware of the benefit
Isn’t HTML specifically about using a specific set of tags? Like if you are writing HTML with a bunch of custom tag names you’re actually writing XML and not HTML?
I think that was the theory with XHTML, but it never happened in practice. XHTML meant that HTML was just XML with specific tags.
But we don’t use XHTML anymore – we use HTML5, which is more compatible with HTML 4, HTML 3, etc.
All browsers have always ignored HTML tags they don’t understand, because otherwise HTML could never be upgraded … e.g. I just tested this and it just shows “foo”. Maybe it makes it behave like
<span>
or something.Fun fact, if you put a hyphen in a tag name in html, it’s specifically not a spec tag, so you can create your own formatting around custom tags instead of classes
That’s not quite correct. All tags are explicitly valid, what changes is whether a tag has
Additional non-display semantics (e.g. , , , ….)
Builtin non-default styling - e.g. , (lol!), etc tags that predated that ability to style/layout that isn’t explicitly built in to the browser
But fundamentally the tag name of an element is not important - the DOM APIs are generic string base document.getElementsByTagName, and CSS’s sigil free (so arguably “default”) identifier matches the tag of an element.
The rise of JSX shows that actually XML-ish syntax is a pretty good way to describe a document. It’s just that people used XML for things that aren’t documents, and that was a mistake.
🤯, everything makes so much more sense now.
This bit is not entirely true. AsciiDoctor has “generalized container with attributes”:
https://docs.asciidoctor.org/asciidoc/latest/blocks/open-blocks/
And, my favorite thing to shill for these days, https://djot.net/, has this as an explicit core feature
Maybe the grass is greener, but after many years of hate I kind of miss the old XML. It’s a pain to find a widely, consistently supported format that’s text based with comments. XML has good schema languages and the namespace based extensibility is really very nice - something I miss elsewhere.
Maybe we just tried using it for too much…
Repeat of my orange site comment w/ a few edits:
The ~document chaos is real, but it’s terminological chaos from before computing that turns into trouble when you’re trying to mark it up.
Before computers, you could still reasonably use the term document to refer to things as different as an essay, poem, book manuscript, CV, patent, birth certificate, voter-registration form, and a court transcript.
It would be nice to cleave them apart, but that’ll take a lot of work since English has a dearth of good terms for discriminating between highly-structured and mostly-unstructured documents.
I struggled through writing something roughly about this last year: https://t-ravis.com/post/doc/what_color_is_your_markup/ (https://lobste.rs/s/tb84fm/what_color_is_your_markup)
In those terms, the markup depicted in the linked piece’s “document” is basically all “structural” markup–and I feel like this is the least interesting part of an unstructured document. It’s the same logic that translates a paper address-change form into structured data describing what’s in the fields–but it doesn’t yield any of the leverage that we get from the meaningful associations in the form-as-markup.
We can’t do much more with the structural markup of a free-form document than present it and perhaps translate it to other formats with similar idioms. Structure in this kind of document is somewhat capricious (if five different writers wrote exactly the same 20 pages of text, they might all still use different section/paragraph/sentence boundaries).
The bigger leverage in free-form documents comes from ontological markup that annotates what’s being written about. This is what’s going to enable you to bolt on an interesting extension that your readers can use to jump between every section on your site that discusses the same paper or author. Or enable you to automatically inject birth/death/release dates for people, films, and albums you refer to.
(I don’t mean to suggest the post precludes these–but the verbosity comparison feels much less fair without enough inline annotation to provide a similar level of utility as the star record.)
I sometimes think that XSL has been unfairly overlooked. It’s no fun to write, but the functionality is neat. If your data is already XML, you can use XSL to transform and format it for human readability. Sure, you can write another program to transform your data, but XSL is more descriptive. The XML file is still there if you need something machine-readable, but you’ve enhanced it.
Unfortunately the syntax is grim, the mental model is complicated, and it’s not something people need all that often.
I used to argue that XML was awful and that the whole world should just switch to JSON. Then I used JSON for a bit and hated the lack of tooling for addressing it. Then
jq
came a long and I liked it, but then when humans tried to edit it, I hated it again. Then YAML came long and HOCON and RON and TOML and a host of others, all modest improvements upon or specializations of their predecessors.But few meet the solid objective of XML: a document format, wherein the meat is more plentiful than the bones.
Don’t put a document in an object notation format. Structure metadata well inside the document and its overhead will be less frustrating.
I’m now ~6 years removed from working on a product that was XML, XSL, XPath, and friends from the ground up with its bases in the early 2000s C world from academia. It was Done Right if you ask me, a Search engine product that treated data as documents, not objects. Working with JSON from that stack was hard, but that was during a time of transition before tools that could convert JSON to XML with some kind of convention and vice versa were available.
And I’ve not touched XML since, really. I’ve almost redone my LaTeX resume in XML a few times now, but keep letting other things far more impactful to my life take precedence. JSON Resume is cool, but the mockups I’ve done in XML always felt better.
I’ve never fully understood how this argument gets made. To me, XML and JSON solve non-overlapping problems. I mean, if you see XML like this:
then that should be JSON for sure, but it should never have been XML even before JSON existed… not an xmlns in sight
This is the view that I came to hold a few years ago and that I hold now.
And your example is right on. Perfect example of something that should never have been XML.
Sorry, your history is incorrect.
XML was heavily pushed by IBM during the early 1990s as a way to replace CSV files, fixed width data files, and “just dump the record to disk” formats. It outperforms these in every measure but size. As the IBM model at the time was to provide consultants to a zoo of hardware and software, this allowed a mechanism to slowly wean these systems off legacy data formats. As is often the case, the usual suspects came out to push for XML as a contract negotiation system, a universal solution for some imagined problem, etc. For some time, IBM had some proprietary compression schemes lower in the network stack.
That said, XML has tricky corners and some line noise is valid XML.
I feel like stuff like Jinja, Scribble, ReST all fall into this category as well. I have a hard time imagining using those in cases where I’ve used XML though, because XML by default is spitting out structured output. It’s not that you can’t get an AST of sorts from these other tools, but…