1. 45
    1. 21

      Shouldn’t S-expressions say

      Cons:

      • Car Cdr

      ?

      1. 5

        I love it. Added.

      2. [Comment removed by author]

        1. 0

          whoosh

        2. 1

          @InPermutation wasn’t saying that a con of using S-expressions is having to write the function names car and cdr in your code. They were making a pun on cons (cell), which is defined as a car and a cdr.

    2. 9

      Since BARE is given a mention, I can say that it’s one alternative that’s easy to implement, if that’s a quality that you value (I wrote the Zig implementation.)

      1. 1

        What did BARE grow out of?

        1. 2

          It was a project started by Drew Devault. It seems like it’s a fresh take on the json alternatives mentioned in the article.

          https://drewdevault.com/2020/06/21/BARE-message-encoding.html

    3. 8

      EDN deserves a mention https://github.com/edn-format/edn. It’s pretty much only used by Clojure and adjacent projects (e.g. Datomic), but there are serialization libraries for a bunch of languages (https://github.com/edn-format/edn/wiki/Implementations)

      1. 2

        Author here. I just learned about EDN recently, it’s listed under “S-Expressions”. Definitely worth considering and I like what I’ve seen of it, it just needs to be… I dunno, robustified. Turned into an actual reference document rather than a description, ideally with accompanying test suite.

        1. 2

          Aha, I didn’t see that under S-expressions. Technically I don’t think EDN is actually an S-expression, because it’s not just lists and atoms.

        2. 1

          I like what I see in EDN, but I also wonder if it’s more of a language-specific serialization format. Like Go has the “Gob” format. Python has Pickle.

          Those are tightly coupled to the language itself. In pickle’s case there are several different versions of it.

          EDN looks like it’s about halfway in between… I like the extensibility mechanism with tags, but until it’s implemented in another language, it feels like a Clojure-specific format to me.

          IMO this is the main compromise with serialization formats… You can trade off convenience in one language for interoperability in all languages. JSON is kind of impoverished but that means it works equally well in most languages :)

          The thing that annoys me about both JSON and EDN is that they inherit the browser and JVM’s reliance on 2 byte unicode encodings. You have to use surrogate pairs instead of something like \u{123456}. And JSON can’t represent binary data.

          1. 5

            Well it’s implemented in JavaScript: https://github.com/shaunxcode/jsedn

            Go: https://github.com/go-edn/edn

            Rust: https://github.com/utkarshkukreti/edn.rs

            .NET: https://github.com/robertluo/Edn.Net

            Java: https://github.com/bpsm/edn-java

            More implementations here: https://github.com/edn-format/edn/wiki/Implementations

            Still would be nice to have a spec and test suite though.

    4. 7

      I wish more people knew about JSON5 — it’s super convenient for when you have to hand-write JSON, because it lets you omit quotes around object keys, use single quotes, add trailing commas, add comments. I use it all the time when writing C++ unit tests for APIs that have to consume or produce JSON.

      Of all the binary alternatives I lean toward Flatbuffers. The zero-copy parsing is a huge performance win, and the closely-related Flexbuffers let you store JSON data the same way.

      1. 5

        I’ve recently come across another JSON-with-benefits format recently, Rome JSON, but it doesn’t seem to be formally specified yet.

      2. 2

        Another convenient JSON variant is the superset defined by CUE

        • C-style comments,
        • quotes may be omitted from field names without special characters,
        • commas at the end of fields are optional,
        • comma after last element in list is allowed,
        • outer curly braces are optional.
      3. 1

        JSON5 seems kind dead — several of the repos on its GitHub are archived?

        1. 1

          I implemented my own JSON5-to-JSON translator in C++. It was simpler than writing a full parser and safer than hacking on the existing JSON parser I use. It adds overhead to parsing, but I don’t use JSON5 in any performance-critical context so I don’t mind.

    5. 4

      Mentions Facebook as Thrift’s only notable user but certainly Twitter is another one. Not quite sure if Pinterest still has Thrift speaking services, but they used to a few years ago.

      1. 4

        Airbnb also uses Thrift. Claiming no one outside of FB uses Thrift seems pretty uninformed…

        1. 2

          Thanks to you both, added!

      2. 1

        My company has legacy services in Thrift. The main cons vs. gRPC/Protobuffers are:

        • no streaming RPC; messages have to be read fully into memory on the server before processing can begin
        • no authentication built into the protocol, no good way to do access control or access logging

        The main advantage over Protobuffers is that in Thrift, exceptions are part of the typed RPC IDL, and then form part of the API. This is nicer than setting and checking gRPC context status codes and string details.

        1. 2

          no streaming RPC; messages have to be read fully into memory on the server before processing can begin no authentication built into the protocol, no good way to do access control or access logging

          The section about Thrift gets one major thing right: “Apache is the tragic junkyard of open source projects”.

          Both of the issues you point out above are solved problems for Thrift at FB either integral to the protocol (e.g. streaming) or via the environment (e.g. mutual TLS authentication for point-to-point and CATs for end-to-end, contextual authentication)

    6. 4

      It is indeed kinda funny that with CBOR it’s much easier to find implementations (one of them even started by me) than users of those implementations :)

      CBOR was originally designed for IoT stuff (to be delivered as CoAP payloads), but probably the most famous use these days is the FIDO2 CTAP + WebAuthn world. Still, it’s constantly gaining popularity as a general purpose format, popping up in various projects everywhere. e.g., crev, Liberapay.

      Also notable: CDDL is a CBOR-oriented schema language, used in the WebAuthn spec for example.

    7. 4

      There’s also the super-simple Bencode. It’s not really any better than JSON, and in some ways worse, but it’s out there and I’ve always has a soft spot for it.

      It’s main advantages are:

      • dead simple format
      • easy to implement
      • can be kinda-sorta human readable but not really
      • each encoded data structure, of any level of nesting, has exactly one correct representation; comparison of data structures can therefore be fast
      • can transmit binary data without having to encode it
      • self-describing

      It’s main disadvantages:

      • the only numeric type is the integer
      • no canonical encoding; strings are just sequences of bytes
      • not really human readable
      • not particularly efficient

      In other words, it’s probably not better than any of the ones mentioned in TFA, but I’ve always liked it. I think because it’s such a fun little format to write implementations for.

    8. 4

      Very subjective, but that’s not necessarily a negative. Definitely a good overview of data serialization formats.

    9. 3

      I’m still always miffes to see XML on these lists, since it’s not a serialization format and the “cons” are always unrelated things (like this article complains abou schemas and validation, which are not part of the data-ish format anyway…)

      1. 3

        If XML is not a serialization format, it sure has been abused to hell and back as one. Notable offenders, in my mind, include XML RPC, Collada, KML, SVG, and probably far, far more that I don’t know about. If you’re making formatted documents or human-writeable markup, then XML isn’t the worst thing ever, but it could still be much better. And, as the lineage at the bottom shows, most of the things on the list were made in reaction to XML in one way or another, so leaving it out is impractical.

        IMO, considering schema and validation apart from the data format is like comparing a programming language separately from its implementations: semantically tidy, but generally not what one is making practical decisions based on. Hence why I break “schema” and “schemaless” into separate categories and don’t try to compare them with each other. So, in the category of “formats with schemas”, I’ve found XML’s schemas very hard to use in practice.

    10. 3

      I think asn.1 is simpler than it’s believed to be but nobody knows it because a) the long strings of funny looking id numbers are intimidating, b) it’s binary, not text, c) lack of familiarity/libraries and d) it’s used in cryptography which is understandably scary. I think asn1 is simpler than xml when you have a pretty printer to hand to eyeball the structure with.

      The distinctions in asn.1 between BER, DER, PER and CER are IMHO a mistake but in the context of data exchange you can just ignore them. An asn1 parser will consume any of them equally happily.

      1. 2

        Also the article wrongfully say that ASN.1 is binary format. ASN.1 is only definition language (it could be replaced with something more readable) that have multiple encodings. And binary formats (like BER) aren’t the only ones available, there is also JER (JSON encoding) and XER (XML encoding) with possibility to add more (in theory nothing prevents encoding ASN.1 described document as a ProtoBuf or Cap’n’Proto document).

        And about CER and DER, these are subsets of BER, and BER parser will happily ingest these. PER is slightly different as it omits some bytes, so it needs separate parser. However there is no such thing as ASN.1 parser that consumes them, as ASN.1 is just description format, not encoding.

        1. 1

          I don’t think I’ve ever seen a textual asn1 document in the wild.

          ASN.1 is only definition language

          I willfully ignore this because you will never ever be tempted to use the DL without the serialisation formats.o

          And about CER and DER, these are subsets of BER, and…

          Yeah this is the problem. If you were doing a modern redesign with the same goals, you’d skip it all and have at most 2 sets of encoding rules: binary and text. Much simpler.

          The use cases for BER, DER, PER and CER are not different enough that it’s worth the complexity cost when you could instead just pick one of them that covers enough use cases & discard the others.

          no such thing as ASN.1 parser

          There’s no such thing as an asn1 user who isn’t also using one of the encodings, so this is a really annoying level of pedantry. Obviously by this I mean a library that can ingest data produced with BER and friends. e.g. https://pypi.org/project/asn1/

          1. 1

            you will never ever be tempted to use the DL without the serialisation formats

            Yes, but the whole point of ASN.1 is to be separate from the encoding. That is why it is called Abstract Syntax Notation One.

            The use cases for BER, DER, PER and CER are not different enough that it’s worth the complexity cost when you could instead just pick one of them that covers enough use cases & discard the others.

            It is not that their use cases are “different” enough. It is like DEFLATE and Zopfli, complexity is required only on one end of the pipeline - sender, receiver do not really care and can treat all data as BER, because it’s parser will happily digest DER and CER. The difference is only in the fact that BER supports more ways to encode the same data and DER will always be canonical. So for simplicity you can always use DER encoder to send data and BER parser do receive it.

            There’s no such thing as an asn1 user who isn’t also using one of the encodings

            AFAIK SNMP do not use any of the mentioned encodings while still using ASN.1 for description of the data. So there is a lot of such users, maybe not fully aware of the fact, but still.

            1. 1

              SNMP do not use any of the mentioned encodings while still using ASN.1 for description of the data

              Huh. Sure, okay, that’s a thing that exists. I don’t think that this is a good argument that anyone would ever want to use asn1 just for specifying a data structure layout Personally I’d say its use in the SNMP spec makes the specification of the wire format quite a bit less clear.

              Contrast many other RFCs having clearer things like BNF grammars, structure declarations from whatever programming language that no member of the working group hated enough to veto, or even ASCII art diagrams with labelled octets.

    11. 3

      Why no avro?

      1. 3

        Avro is mentioned at the bottom, listed as “Hadoop/Apache/Yahoo. Does anyone actually heckin’ use this?”.

        1. 2

          I think most places that use Kafka use Avro. I’ve used it separately as well (protobufs has some detractors). That probably makes it one of the more popular ones.

          1. 1

            Thanks, good to know! Updated the page.

    12. 2

      XDR:

      Cons:

      • Doesn’t necessarily do much unless you’re a C program from the early 1990’s

      Heh…that didn’t stop me from choosing it for a project (this) in 2014. I guess I didn’t know any better? (Shrug.) In hindsight, If I were doing it over today I’d probably…still go with XDR.

      1. 4

        …well given that program is all written in C, which hasn’t changed too much since the early 1990’s, I think it’s probably a pretty reasonable choice!

    13. 2

      Interesting write-up.

      My personal view:

      • Missing a mention is Parquet, IMO. Which you could think of as “columnar JSON, but fancier”. Matters when you are storing billions of rows with common columns. But, IMO, Parquet should never be the source of truth. It’s just not stable enough, nor well-supported across every platform.
      • The source of truth should be JSON – likely, compressed JSON, often “JSONLines” batches, using either gzip for max compat, or zstd with a custom learned dictionary for max compression.
      • When you are encoding JSON over the wire and are worried about fat network payloads or deserialization CPU time, I think you simply one-for-one swap JSON in memory for msgpack on the wire. Yes, I guess CBOR is OK, too, since it seems like they are roughly the same thing.
    14. 2

      Cap’n Proto, Users:

      • Cloudflare
      1. 2

        Added, thank you!

      2. 1

        Somewhat. I’m only aware of a couple of things using CapnP. There are probably more services using gRPC (protobuf).

    15. 1

      So, how do cap’n proto and flatbuffers compare? The page doesn’t really address this.

      1. 2

        See the “Conclusions” section. Getting deep into the details isn’t the purpose of the page. Partially ‘cause I don’t have that level of expertise.

    16. 1

      I like S-expressions. If you’ve ever written OCaml, you’ve probably run into them. More compact than JSON, with fewer rules, just as expressive, and fairly readable in small doses. https://dev.realworldocaml.org/data-serialization.html They lend themselves well to both data serialization (turn that Book into a book.sexp file) and configuration (tell Dune about your cool library).

    17. 1

      Pity it doesn’t mention Rebol or Red, which, by the virtue of homoiconicity, are their own data format, with literal forms for things like e-mails, IP addresses, URLs, dates and hashtags with @references. Red also features Redbin format to which it can be serialized.

    18. 1

      For completeness, Hjson is another alternative. It targets config files, not data serialization, so it makes sense that it wasn’t mentioned.

      Hjson is similar to JSON5, but with more permissive syntax. Like YAML, Hjson doesn’t require quotes around strings, and allows newlines to substitute for commas. It has finished implementations for eight languages.

      Personally, Hjson’s non-explicit syntax scares me. I would rather use JSON5.