Something bugs me quite a bit about this comparison: very little space is dedicated to comparing the actual formats from the first principles, it’s almost 100% look at the derived artifacts — size&format of the spec, historical circumstances leading to creation, popularity, benchmarks. The closest to fundamentals’ analysis is itself second-order — reference to someone else’s summary opinion on HackerNews.
Now, the derived stuff is hugely important, especially for serialization formats, which sit on the interoperability boundary, but it still feels very wrong to not look at them in the context of fundamentals. From the writing style, it does seem that the author knows what they are doing, and I guess I should update in the direction of CBOR a bit, but, still, I am surprised just how little I was able to extract from the article in terms of what a good serialization format should look like.
I really really wanted to include more examples but this I had trouble justifying spending so much time on a “sidequest”. I’m hoping to include a deeper dive in my flat scraps documentation
I’ve got some history in this domain, having created the Fleece format for Couchbase Lite. My priorities with Fleece were (1) fast parsing, since it’s used to store database records that are read much more often than being written, and (2) compactness, especially when storing many documents with similar schemas.
I don’t think I had seen MsgPack or CBOR at the time I designed Fleece in 2015. I’d looked at various other formats like BSON and realized that most of the expense of deserialization wasn’t from parsing the data, rather from allocating the DOM objects. So I made Fleece to not require any memory allocation: the DOM objects are simply pointers into the encoded data. (Cap’n Proto and FlexBuffers share this property.) Additionally, there are no O(n) algorithms accessing data: strings have a length prefix, array items are fixed-size, and object keys are stored as sorted arrays that can be binary-searched.
(The entire Fleece library has gotten pretty big, but most of that is for auxiliary features, as well as some heavy optimization. A basic codec is pretty easy to implement.)
Side thread: personally I don’t feel like Github stars are a good metric for the popularity of a projects. What do you think?
I don’t have a better way to estimate project popularity; just saying that Github stars seem not useful to me. In about 16 years of using Github I have starred less than 30 projects, but I’ve probably used ten times as many Github projects (probably much more). Look like I just don’t star projects :-) .
And there might actually be a bias in the star counts, in that some projects attract users that are more likely to hand out stars.
What makes you give a star to a Github project? Do you give stars for any project that sounds interesting, or any project that you use, or any project that you feel exceptionally thankful for?
agreed, they are pretty useless as a metric for anything. I think they mostly measure “how much reach on social media/reddit/HN/…. has this ever gotten” in many cases, and that’s not informative of anything. (I personally star a lot, but really treat it as a bookmark along the lines of “looks vaguely interesting from a brief skim”, its not an endorsement in any way)
I’m pretty sure I’ve never starred a project on GitHub, or at least I haven’t in the past decade, and I don’t know why anyone would! It’s an odd survival of GitHub’s launch era, when “everything has to be social” was the VC meme of the moment and “social open source” was GitHub’s play.
I don’t get why popularity is so important here. Isn’t it effectively an implementation detail of your language? Even if I’m misunderstanding that and it is not, isn’t the more important question “are there good implementations”, not “are the implementations more popular than the ones for the other thing”?
One huge use-case for my language is sending and receiving and storing programs, so yes, it’s an implementation detail, but it’s also a very important one that will be impossible to change later.
But you’re totally right – that is the main question. I’m still exploring the space of serialization candidates, and these two particularly stood out to me.
I mostly care about popularity because convincing people to trust/adopt technology is way harder than actually implementing it. Extending a trusted format seems less risky than other options
I haven’t implemented them from scratch, but I’ve used them extensively while designing scrapscript and building smel shell :) In this writeup, I tried to balance my opinions with existing discussions
Fair; sorry if I was a bit snippy. The writeup just seemed to involve a lot of looking at what other people say/think and I expected more of your own thoughts.
Another possibility is Preserves (https://preserves.dev/), which has some nice properties that MessagePack and CBOR don’t: it has one canonical, signable form, and has both a text and binary representation.
ASN.1 has a gigantic specification. It looks like written in totally different era. (I would say “pre-Github” era.) It looks like written in era when programming had nothing common with fun. It looks like written by managers, not by programmers, and especially not by programmers who love their work. This spec scared me away completely, and I will never consider ASN.1 for my projects
Yeah I mean it certainly was written in a different era (the 80s, I believe). I quite like it though. And DER (one of the main encodings for ASN.1 messages) is actually very simple FWIW. But sure I mean I doubt anyone’s gonna come along and pressure you to use it in your projects (though you may already be using it without realising if you’re using GSM, PKCS, Kerberos, etc). I for one find it quite satisfying to use the same serialisation/deserialisation paradigm being utilised in other parts of the stack but that’s just me haha
It was written by telecoms companies in the early 80’s, before TCP/IP created the internet and made most other networking technologies obsolete. Not just the pre-Github era, but the pre-PC and pre-internet era. Lots of stuff from the 50’s and 60’s looks like that; to some extent it wasn’t until the late 60’s and 70’s that “programmers who love their work” became a powerful technical force outside of universities. You aren’t allowed to have fun if you work for a government or giant company on machinery that costs more than your entire life does.
It’s been a long time since I’ve looked at ASN.1 outside of X.509 but you inspired me to go look. I’m pretty surprised to see that there really isn’t much that Protobufs do that ASN.1/DER does not. Varints being an easy example of something that Protobufs do but… wow are they similar!
Yeah they really do cover a lot of the same ground! But then with ASN.1 it’s so much more standardised, I feel like it’s massively underrated. I also find ASN.1’s ability to be encoded in a bunch of different ways depending on what’s appropriate for the situation-at-hand super cool, and it’s something that Protobuf doesn’t really do :)
Yeah I’m using Erlang/OTP’s built-in ASN.1 compiler, which is very nice. On the client side (a Vala/GTK app) I’m currently just deserialising directly from the DER right now, but I’ll probably switch to using a proper ASN.1 compiler at some point (or maybe I’ll make my own limited one for Vala, who doesn’t love a good yak shave). I’ve used asn1c and asn1scc before too, both have worked well for me.
This is a weird article. The author compares two relatively recent serialization protocols without mentioning any of the more common ones, then uses criteria like how “boring” the name sounds, how many votes it got in a popularity contest, and whether one of the people involved has a funny last name.
@surprisetalk , I hope this comment will be very useful to you.
I want some format, which is typed (similarly to protobuf), but embeds its type in the header.
Also (assuming it exists) I want to write implementation, so I have a requirement: the format should be well-designed and should have clear spec.
I spent a lot of time searching for such format and found exactly one popular format: Avro (more precisely: Avro object container file). Unfortunately, its spec is awful: https://www.mail-archive.com/user@avro.apache.org/msg04592.html (this is list of 20-30 spec problems written by me). (This problems matter, because I wanted to write my own Avro library.) (It is possible that the situation improved since I wrote that mail.)
So, it seems format I want simply doesn’t exist.
So, I plan to create one in the future. (This may be very distant future, say, 10 years. Or simply never.)
Of course, feel free to steal this idea.
Also: data-together-with-its-own-type is essentially instance of sigma type, i. e. existential type. (This is how such type is called in PLT [programming language theory].) So, Avro object container file is instance of some particular sigma type (which is always same for all Avro files). In other words, it is possible to create Avro library in some dependently typed language (Rocq, Lean, Idris, Agda) and that library will be very natural, as opposed to libraries in “normal” languages, such as C++ or Rust or JavaScript. You will be able to represent every Avro file as an instance of a single sigma type. And you will be able to manipulate Avro values very naturally. (In a way impossible in all other languages, both typed and untyped.)
Unfortunately, Avro community seems to be totally unaware about this. It seems Avro community and PLT community are absolutely disconnected. (And, of course, there is no Avro library for Rocq/Lean/Idris/Agda.)
Avro seems like enterprisish stuff, and Avro people seem not care about all this PLT. It is possible I’m first who noticed this link between Avro and dependently typed languages.
Anyway, I think that new format for typed-data-with-its-own-type should be created. I hopefully will create such format in 10 years. (Well, I have prototype written in C++.)
I’m assuming rust# is a typo and not a proper dotnet backend for rust that’s stable enough people are building on it. (kinda got my hopes up there for a minute.)
Baffling to me that anyone would think “it has an RFC” makes something uncool. Corporate formats wish they were cool enough to have an RFC.
Something bugs me quite a bit about this comparison: very little space is dedicated to comparing the actual formats from the first principles, it’s almost 100% look at the derived artifacts — size&format of the spec, historical circumstances leading to creation, popularity, benchmarks. The closest to fundamentals’ analysis is itself second-order — reference to someone else’s summary opinion on HackerNews.
Now, the derived stuff is hugely important, especially for serialization formats, which sit on the interoperability boundary, but it still feels very wrong to not look at them in the context of fundamentals. From the writing style, it does seem that the author knows what they are doing, and I guess I should update in the direction of CBOR a bit, but, still, I am surprised just how little I was able to extract from the article in terms of what a good serialization format should look like.
I really really wanted to include more examples but this I had trouble justifying spending so much time on a “sidequest”. I’m hoping to include a deeper dive in my flat scraps documentation
I’ve got some history in this domain, having created the Fleece format for Couchbase Lite. My priorities with Fleece were (1) fast parsing, since it’s used to store database records that are read much more often than being written, and (2) compactness, especially when storing many documents with similar schemas.
I don’t think I had seen MsgPack or CBOR at the time I designed Fleece in 2015. I’d looked at various other formats like BSON and realized that most of the expense of deserialization wasn’t from parsing the data, rather from allocating the DOM objects. So I made Fleece to not require any memory allocation: the DOM objects are simply pointers into the encoded data. (Cap’n Proto and FlexBuffers share this property.) Additionally, there are no O(n) algorithms accessing data: strings have a length prefix, array items are fixed-size, and object keys are stored as sorted arrays that can be binary-searched.
(The entire Fleece library has gotten pretty big, but most of that is for auxiliary features, as well as some heavy optimization. A basic codec is pretty easy to implement.)
That’s cool! Fleece was a big inspiration for twine. In fact I’d describe it as halfway between CBOR and fleece :-)
it’s hard to beat a zero-allocation format + gzip for size
delta encoding looks useful
Side thread: personally I don’t feel like Github stars are a good metric for the popularity of a projects. What do you think?
I don’t have a better way to estimate project popularity; just saying that Github stars seem not useful to me. In about 16 years of using Github I have starred less than 30 projects, but I’ve probably used ten times as many Github projects (probably much more). Look like I just don’t star projects :-) .
And there might actually be a bias in the star counts, in that some projects attract users that are more likely to hand out stars.
What makes you give a star to a Github project? Do you give stars for any project that sounds interesting, or any project that you use, or any project that you feel exceptionally thankful for?
agreed, they are pretty useless as a metric for anything. I think they mostly measure “how much reach on social media/reddit/HN/…. has this ever gotten” in many cases, and that’s not informative of anything. (I personally star a lot, but really treat it as a bookmark along the lines of “looks vaguely interesting from a brief skim”, its not an endorsement in any way)
I’m pretty sure I’ve never starred a project on GitHub, or at least I haven’t in the past decade, and I don’t know why anyone would! It’s an odd survival of GitHub’s launch era, when “everything has to be social” was the VC meme of the moment and “social open source” was GitHub’s play.
I find it useful as a bookmark, a way to search a curated portion of GitHub later on.
I use it as a “read later” flag when I see a link to a project there but don’t have time to fully consider it in the moment.
Note that I also tried to use Google trends, but both keywords fell under the threshold for tracking over time !
You can also compare download count from package managers like NPM, but I didn’t have an easy way to do that for so many libraries
I don’t get why popularity is so important here. Isn’t it effectively an implementation detail of your language? Even if I’m misunderstanding that and it is not, isn’t the more important question “are there good implementations”, not “are the implementations more popular than the ones for the other thing”?
One huge use-case for my language is sending and receiving and storing programs, so yes, it’s an implementation detail, but it’s also a very important one that will be impossible to change later.
But you’re totally right – that is the main question. I’m still exploring the space of serialization candidates, and these two particularly stood out to me.
I mostly care about popularity because convincing people to trust/adopt technology is way harder than actually implementing it. Extending a trusted format seems less risky than other options
Author should probably actually read/implement/use both to see which is better, not only research what other people say…
I haven’t implemented them from scratch, but I’ve used them extensively while designing scrapscript and building smel shell :) In this writeup, I tried to balance my opinions with existing discussions
Fair; sorry if I was a bit snippy. The writeup just seemed to involve a lot of looking at what other people say/think and I expected more of your own thoughts.
Another possibility is Preserves (https://preserves.dev/), which has some nice properties that MessagePack and CBOR don’t: it has one canonical, signable form, and has both a text and binary representation.
I decided to use ASN.1/DER in a recent project because I was feeling old school lol
ASN.1 has a gigantic specification. It looks like written in totally different era. (I would say “pre-Github” era.) It looks like written in era when programming had nothing common with fun. It looks like written by managers, not by programmers, and especially not by programmers who love their work. This spec scared me away completely, and I will never consider ASN.1 for my projects
And proprietary ASN.1 compilers are another argument against ASN.1
Yeah I mean it certainly was written in a different era (the 80s, I believe). I quite like it though. And DER (one of the main encodings for ASN.1 messages) is actually very simple FWIW. But sure I mean I doubt anyone’s gonna come along and pressure you to use it in your projects (though you may already be using it without realising if you’re using GSM, PKCS, Kerberos, etc). I for one find it quite satisfying to use the same serialisation/deserialisation paradigm being utilised in other parts of the stack but that’s just me haha
It was written by telecoms companies in the early 80’s, before TCP/IP created the internet and made most other networking technologies obsolete. Not just the pre-Github era, but the pre-PC and pre-internet era. Lots of stuff from the 50’s and 60’s looks like that; to some extent it wasn’t until the late 60’s and 70’s that “programmers who love their work” became a powerful technical force outside of universities. You aren’t allowed to have fun if you work for a government or giant company on machinery that costs more than your entire life does.
It’s been a long time since I’ve looked at ASN.1 outside of X.509 but you inspired me to go look. I’m pretty surprised to see that there really isn’t much that Protobufs do that ASN.1/DER does not. Varints being an easy example of something that Protobufs do but… wow are they similar!
Yeah they really do cover a lot of the same ground! But then with ASN.1 it’s so much more standardised, I feel like it’s massively underrated. I also find ASN.1’s ability to be encoded in a bunch of different ways depending on what’s appropriate for the situation-at-hand super cool, and it’s something that Protobuf doesn’t really do :)
Did you use an ASN.1 compiler? Was it a commercial one you already had access to, or is there a FOSS one you like?
Yeah I’m using Erlang/OTP’s built-in ASN.1 compiler, which is very nice. On the client side (a Vala/GTK app) I’m currently just deserialising directly from the DER right now, but I’ll probably switch to using a proper ASN.1 compiler at some point (or maybe I’ll make my own limited one for Vala, who doesn’t love a good yak shave). I’ve used asn1c and asn1scc before too, both have worked well for me.
Did you have looked at https://github.com/vshymanskyy/muon ? While it is not popular, I think the simplicity of it makes it very cool
Oh this looks VERY cool. It might make sense to fork this and combine it with Max’s serializer. Thanks!
This is a weird article. The author compares two relatively recent serialization protocols without mentioning any of the more common ones, then uses criteria like how “boring” the name sounds, how many votes it got in a popularity contest, and whether one of the people involved has a funny last name.
Nice blog, please timestamp the posts :-)
@surprisetalk , I hope this comment will be very useful to you.
I want some format, which is typed (similarly to protobuf), but embeds its type in the header.
Also (assuming it exists) I want to write implementation, so I have a requirement: the format should be well-designed and should have clear spec.
I spent a lot of time searching for such format and found exactly one popular format: Avro (more precisely: Avro object container file). Unfortunately, its spec is awful: https://www.mail-archive.com/user@avro.apache.org/msg04592.html (this is list of 20-30 spec problems written by me). (This problems matter, because I wanted to write my own Avro library.) (It is possible that the situation improved since I wrote that mail.)
So, it seems format I want simply doesn’t exist.
So, I plan to create one in the future. (This may be very distant future, say, 10 years. Or simply never.)
Of course, feel free to steal this idea.
Also: data-together-with-its-own-type is essentially instance of sigma type, i. e. existential type. (This is how such type is called in PLT [programming language theory].) So, Avro object container file is instance of some particular sigma type (which is always same for all Avro files). In other words, it is possible to create Avro library in some dependently typed language (Rocq, Lean, Idris, Agda) and that library will be very natural, as opposed to libraries in “normal” languages, such as C++ or Rust or JavaScript. You will be able to represent every Avro file as an instance of a single sigma type. And you will be able to manipulate Avro values very naturally. (In a way impossible in all other languages, both typed and untyped.)
Unfortunately, Avro community seems to be totally unaware about this. It seems Avro community and PLT community are absolutely disconnected. (And, of course, there is no Avro library for Rocq/Lean/Idris/Agda.)
Avro seems like enterprisish stuff, and Avro people seem not care about all this PLT. It is possible I’m first who noticed this link between Avro and dependently typed languages.
Anyway, I think that new format for typed-data-with-its-own-type should be created. I hopefully will create such format in 10 years. (Well, I have prototype written in C++.)
Ask me any questions
Here is text about why type safe formats are good thing: “type-safe Web App Nirvana” ( https://web.archive.org/web/20220520015925/https://www.improbable.io/blog/grpc-web-moving-past-restjson-towards-type-safe-web-apis )
I’m assuming rust# is a typo and not a proper dotnet backend for rust that’s stable enough people are building on it. (kinda got my hopes up there for a minute.)
Where can I find the spec(s)?