Would be interesting to see how it stacks up to a fairly diverse benchmark. The analysis is fairly impressive.
That’s great even though, these seem to largely be very text/string heavy (also for keys). While there’s some numeric ones, a lot of them are arrays or complex structures. This however is rarely true if one for example does a stream of metrics. I mostly mention it, because I not too long ago compared serialization formats after gzip for metrics information.
But overall there aren’t many surprises if one even takes a small glimpse on how serialization formats work. So if one really does need to care and doesn’t just assume things it’s pretty easy to find a fitting format. That actually was a side-goal with the analysis, finding out if there’s something unexpected.
Anyways, that repo seems to make it relatively easy to add new data and new formats. So one should have an easy time to compare it with your own data.
On a related note, some time ago when a lot of projects switched to msgpack it struck me as odd when a co-worker decided to replaced JSON with msgpack. In this scenario not only did that mean an additional dependency, slower code (because calling C was slow), but also resulted in way more data, simply because it was arrays of decimal numbers, so for example [9.0,2.3,1.3]. On top of that it was of course complete pre-mature “optimization” and a part of the code that likely wouldn’t ever become the bottleneck.
In other words: Get a basic grasp of what is happening. msgpack is so nice to even provide a tool on their website to compare it.
Since back then it was relatively new I think it made a perfect example of jumping on the new cool thing because X uses it and it’s supposed to be better. Here it’s just very clearly visible. The more complex the thing is the harder it is to really say whether something is an improvement, especially when it’s solutions meant to deal with complexity, be it web frameworks, workload/container orchestrators, etc. I think time spent on figuring out if and what you need and having a good and especially complete picture of whether it improves the thing you want to improve is usually time well-spent. Just don’t forget that “we change nothing” can be a valid outcome.
Wow, what a nice write-up! Just taking a glance, it reflects my own experience with some of the formats.
The graph that compares the various formats after gzip is kind of a killer - it seems that compression makes the differences between the various formats more-or-less irrelevant, at least from the perspective of size. For applications where size is the most important parameter and a binary format is acceptable, I think I might just tend to prefer gzipped JSON and feel happy that I probably won’t need any additional libraries to parse it. If I got concerned about speed of serialization and deserialization I’d probably just resort to dumping structs out of memory like a barbarian.
The main issue with gzipped documents is that you must unpack them before use. It will hurt badly if you need to run a query on a big document (or a set of documents). I recommend reading BSON and UBJSON specs, which explain it in details.
The syntax you’ve chosen requires a linear scan of the document in order to query it, too, so it’d be a question of constant-factor rather than algorithmic improvement, I think?
constant factors matter!
That makes sense; I had been blinded to that use-case by my current context. I’ve been looking at these kinds of formats recently for the purpose of serializing a network configuration for an IoT device to be transmitted with ggwave. if I was working on big documents where I needed to pick something out mid-stream, I could definitely see wanting something like this.
i’m currently doing a similar thing, but for OTA upgrade of IoT devices: packing multiple files into Muon and compressing it using Heatshrink.
It’s then unpacked and applied on the fly on the device end.
I think I’ll also publish it in the following weeks
You can stream gzip decompression just fine. Not all queries against all document structures can be made without holding the whole document in memory but for a lot of applications it’s fine.
Yeah, json+gzip was so effective at $job, we stopped optimizing there. Another downside not mentioned by the other replies, though: gzip can require a “lot” (64KB) of memory for its dictionary, so for example, you couldn’t use that on an arduino.
BTW, you can use Heatshrink compression on arduino (as I do currently)
The main advantage of a binary format isn’t size, but speed and ability to store bytestrings
I have encountered similar situations and thought the same, but one place this was relevant was ingesting data in WASM. Avoiding serde in Rust, or JSON handling in another language, makes for significantly smaller compiles. Something like CBOR has been a good fit since it is easy/small to parse, but this looks interesting as well.
When a disease has many treatments, it has no cure.
I think the origin of that saying was referring to depression, but it really applies wonderfully to serialization!
The “implementation” slide is really nice - a pageful of code!
Link to the GitHub repo: https://github.com/vshymanskyy/muon
I have become something of a stuck record about this, but syntax is boring: semantics is where it’s at. When are two values (denoted by syntax) the same? When are they different? For example: Is B0 00 the same as A0 the same as B1 00 00? Is A1 the same as B9 00 00 80 3F? When are two dictionaries the same? (Are duplicate keys permitted? Is ordering of keys important?) Is +Inf encoded AE the same as +Inf encoded using tags B8, B9 or BA?
Aside from equivalences, I have other questions: Can I represent pure binary data that isn’t a string? What is a “tag” (bytes 8A-8F, FF)? What is an “attr”? Why does a typed array end in 00? What happens if a constrained system with a short LRU cache is presented with a document using a large LRU index?
: Hence my work on Preserves
Thanks for those excellent questions!
Equivalences are there on purpose. You can select from fast or small representation, for example. Also, some representations are not available in TypedArrays.
Typed array ends in 00 because it denotes an empty chunk (please note chunking is allowed here).
Muon allows adding tags (see github repo) with additional infoabout object sizes inside of the document to enable efficient queries (entirely skipping uninteresting parts by the parser).
LRU size is an application-specific detail, but it can also be explicitly encoded in the document, if needed
From what I can tell, this encoding is not bijective. I know it’s a not a terribly important thing to ask, but I do wish it had that property. Otherwise, this looks very nice!
Do you mean that it should be free of equivalent representations?
Yeah, it means that. It also means that there is exactly one representation, and for every representation there is only one way for it to be decoded. Right now I’m using bencode which is a very nice serialization format that’s great for a) binary data and b) its bijective. One nice side effect of this, which is how I think it’s being used in bittorrent, is that you can encode an object, take its digest and compare those digests to know if you have the same thing.
But it’s also not true for JSON, i.e:
is the same as
Actually JSON doesn’t specify one way or the other whether those two documents are the same.
Oh yeah, definitely not true for JSON. I mean, look at how many json libraries offer to sort keys for you, and you can see that people want to use JSON that way (I guess mainly for things like caching where you want the json to encode the same way twice)
I think an additional thing to consider is evolution of the semantics and the ability to reason about contextual equivalence and picking representatives of equivclassess… yes, I am also working on this kinda stuff, my design is finally stable, implementation hit a snag when my devices were stolen but is finally crawling ahead again. Mostly there is just chaos in written form but I am very willing to explain and discuss, also the chaos will get sorted (because it has already been sorted out in my head, at least to a sufficient extent to have confidence in my roadmap).
This seems sensible enough. I do note there seems to be a discrepancy between the presentation and the reference implementation: the presentation gives special values for the float special cases NaN, +Inf, and -Inf, but a scan of the source code seems to indicate that these values are neither parsed nor generated by muon.py.
Implementation is incomplete, it was mainly designed to compare Muon with JSON, but JSON lacks +-Inf and NaN
Using unused UTF-8 bytes is a nice idea I haven’t seen before, and the implementation is admirably succinct.
Question: what is meant by “zero-copy” in this context?
In Muon, it should be possible to encode data in such a way that you can access it in-place, without copying it to a different memory location.
I wonder how it compares to other machine-readable formats like Protobuf? See also this page: https://wiki.alopex.li/BetterThanJson
I think (based on the spec at json.org) that JSON strings can have the zero codepoint but I don’t think uON allows that…
It does. For this specific case encodes needs to use the fixed-length string (prefix 0x82)
This is awesome! Last week I was futzing around with a subset of CBOR for a very targeted use case, but I think I might check this out further and potentially adopt it!
Let me know if I can help. Feel free to create a ticket on Muon repo
Unlike ASN.1, I understand it, but, how do they stack up?
I did not compare with asn.1, but Muon is obviously simpler 😁
For really large Json docs, with many reparative keys and short values, I usually will take a step of optimization by storing the keys and values separately. Isn’t this uson is like another form of bencode?
Yes, the purpose is similar to bencode, but Muon is more flexible and efficient