1. 11
  1.  

  2. 9

    You say that XML should only be used for documents, but with the exception of HTML it is primarily used as a data serialization format. Therefore it is useful to compare XML with other data serialization formats.

    I don’t think Erik Naggum’s quote is very insightful. The attributes have some meaning to the computer, and that meaning can be conveyed equally as well in JSON. XML doesn’t attach any sort of special semantics to attributes so there is no point in even making the “real meat” distinction in the first place, it’s a flimsy argument based on assigning overdue importance to syntactical details. If XML had some sort of guarantee that some invariant was preserved under stripping attributes then maybe, but I have no idea what that would look like, and I’m not sure it would be useful.

    The only sense I can think of that a “document” is different than other forms of data serialization is that sometimes documents are human readable/writeable. In that case I think XML doesn’t compare favorably to Markdown, reST, TeX, etc.

    1. 8

      You say that XML should only be used for documents, but with the exception of HTML it is primarily used as a data serialization format.

      Far from it. TEI is used throughout the digital humanities to store all sorts of texts — language corpora, transcriptions of old printed books and manuscripts, reference works, speech transcription, etc. And there’s also DocBook, xml2rfc, MathML, DITA, alongside word-processing formats like OOXML, OpenDocument, and IDML which save rich text to XML.

      (Also, HTML itself is not an XML application, though there continues to be an XML serialization defined which hardly anyone uses.)

      I don’t understand your second paragraph at all, I’m afraid. Naggum’s argument is that information like, say, a coded identifier for a particular section of a document (like HTML’s id attribute), the destination of a hyperlink (a@href, img@src), and other purely ‘internal’ info which is of interest only to the programmer processing the document and perhaps also the person authoring it, but not in its raw form to readers, is suitable for use in attributes. He doesn’t say that they’re meaningless, more that they’re only of ‘behind-the-scenes’ meaning which generally produces or enables some other behaviour which is meaningful to actual document readers.

      The only sense I can think of that a “document” is different than other forms of data serialization is that sometimes documents are human readable/writeable. In that case I think XML doesn’t compare favorably to Markdown, reST, TeX, etc.

      Indeed in many cases it does not — see my point on how I wish SGML had survived. Markdown and reST however are not good for highly structured documents with more than the mere ‘generic’ document semantics HTML provides, and TeX can really only be processed by TeX. (Because of that, DSSSL could process SGML into TeX like XSLT can process XML into HTML.)

      1. 3

        Markdown and reST however are not good for highly structured documents

        You can see this by the mess of syntax they turn into when people try to extend them to handle the features they don’t originally support. The only two “new-generation” markup languages I can think of that have a reasonably complete range of document-markup features are Pandoc’s much-extended version of markdown, and the org-mode file format. And those have such an array of special-case syntax and magic sigils that they start to look like the ‘90s version of Perl. Now, org-mode is at least rarely read or written directly; it’s written through a special emacs mode that does most of the markup bookkeeping and hides it by default. But if you’re willing to build a special editor to read/write the document, XML is fine as a document format too, and a more stable format to store and parse (basically nothing can correctly parse the full range of Pandoc markdown except Pandoc, and likewise for parsing org-mode with anything other than org-mode).

        1. 3

          Your counterpoint was like the smallest conceivable example. “Digital humanities” is such a microscopic sliver of the usage of XML, keep in mind that AJAX, originally meant Asynchronous Javascript and XML. When someone says primarily, they don’t mean exclusively, they mean primarily.

          1. 2

            I’m not sure I buy Naggum’s argument in any case. If a person is reading the raw XML then they can see the attributes just as well as they can see the tag data, so it doesn’t matter if something is in an attribute or a tag. On the other hand, if they’re reading the document after it’s been processed into another format, then it doesn’t really matter if the processor got the data from an attribute or a tag element.

            On top of that, there are thousands of formats where XML is used for things that aren’t documents, and in those cases the decision between using an attribute or using a tag pair is arbitrary. Take the popular GPX format as an example. Latitude and longitude are stored as attributes in the “trkpt” tag, while elevation and time are stored in “ele” and “time” tags inside of each “trkpt”.

            1. 1

              (Also, HTML itself is not an XML application, though there continues to be an XML serialization defined which hardly anyone uses.)

              HTML5’s greatest mistake was not being XHTML. It’d make parsing much easier.

              1. 3

                More of HTML’s parsing hairiness came from the necessity to pick between implementing one of multiple different competing underdocumented parsing algorithms for it, than from the fact that the grammar itself is complicated. The fact that HTML5 specifies the parsing algorithm completely and unambiguously, including what to do on malformed input reduces the net difficulty of parsing, far more than the fact that the HTML5 parsing algorithm is big and complicated increases it.

          2. 6

            I don’t understand the point of this argument. Like it or not, XML has been used as a serialization format from the very beginning, and it was strongly promoted as one for a very long time. For at least a decade the rule of thumb was that any new data serialization format that could benefit from being human readable should be in XML because there are lots of tools for working with it. That’s part of the reason why there are so many query and transformation tools like XPath, XSLT, XQuery, etc..

            Data serialization is a huge use of XML, and it’s silly to pretend it can’t be compared to another serialization format. For every “don’t do this” rule you can make it’s not hard to find a dozen or more widely used real life examples that break the rule.

            IMO, the common sense thing to do for new data formats is to use whatever the rest of your system is using. Why add a new dependency on a JSON reader if you have 10 things using XML already?

            1. 2

              “If it looks like a document, use XML. If it looks like an object, use JSON. It’s that simple. ”

              This is a bit of revisionism. I was an early tester of XML tech coming from a HTML background. They were pushing two things simultaneously: more flexible format for documents than HTML w/ associated transformation tech such as XSL; data exchange format. The way they described documents, web stores, and so on basically was as a data exchange format for arbitrary types of data. It would be rendered using blah blah blah. The companies started doing way more of the data, format usage instead of the document, renderer usage.

              Fast forward many years. It became the default tech to use in the space JSON now competes. Both are in heavy usage in this space. So, comparisons are warranted given there’s whole ecosystems supporting applying each one to the same problems.

              1. 1

                I’m gonna start putting JSON inside of CDATA elements.

                1. 1

                  Made me discovored CBOR nice !