1. 58
  1. 13

    MessagePack is also used by saltpack, a modern crypto messaging format.

    1. 2

      Thanks for the link, that’s cool!

    2. 10

      Msgpack has the same flaw that a lot of other binary-JSON formats do: no random access. Since array and object items are variable length, the only way to find item N is to parse and skip items 0…N-1. Recursively. This means that for most practical purposes, you need to convert the whole thing into a native collection tree (Array, HashMap, NSDictionary, etc.) This is many times more expensive than the actual data parsing since it involves allocating memory and constructing objects.

      The two formats I know of that avoid this are Google’s FlexBuffers and my own (well, Couchbase’s) Fleece. They both use internal ‘pointers’ (offsets) to create fixed-width arrays and key/value mappings. This allows them to avoid creating objects; instead you can just work directly with interior pointers in the data, which has almost zero overhead.

      1. 3

        It won’t obviously be as fast as a format you can random access, but I wrote a CBOR parser that is stack/recursing based with no mem usage: https://github.com/quartzjer/cb0r/blob/master/src/cb0r.c

        It’s primarily for constrained embedded use-cases where CBOR is an excellent fit.

      2. 9

        Something to keep in mind when talking about performance is that JSON is very very fast in the browsers, and would be hard to beat for that use case. It would be interested to compare with a WASM MsgPack decoder.

        1. 5

          MessagePack has a couple benefits over JSON even if the parsing was not as fast. Constructed properly, it’s smaller than even compressed JSON, and it’s substantially better for streaming because it has length-prefixed sections (length in terms of number of items, not in terms of bytes) – in other words, it is possible to process and generate streaming msgpack without much caching and without storing much state.

          Unfortunately, most implementations don’t actually support streaming, but bitbanging msgpack is almost as straightforward as constructing json with string manipulation & implementing your own msgpack parser is a hell of a lot easier than parsing json, so needing to roll your own msgpack implementation (while it may look intimidating) is no barrier to even a novice developer.

          1. 1

            Is this because it can be loaded in as a native object without writing a parser for it in JavaScript? That comes with its own problems

            1. 2

              No need to load as a native object. JavaScript has had a (native) parser API for ages (JSON.parse(), since IE8)

              1. 1

                If you trust the source, the javascript parser is (much) faster.

                Yes, I am aware of how ridiculous that sounds.

                No, I don’t know why.

                Yes, I have benchmarked it.

                No, I don’t still have the results.

                1. 2

                  On what JavaScript VM, for what data size, on which platforms did you find this to be true?

                  Chrome recommends embedding large constants as JSON to speed up parsing executable JS: https://v8.dev/blog/cost-of-javascript-2019#json

                  But, they’re discussing JS parsing at app boot time, not deserialization.

                  1. 1

                    This was ~10 years back, so my memory is imperfect, but we were primarily targeting Firefox and also testing IE. A megabyte or so of JSON, with lots of multi-kilobyte string values.

          2. 8

            The way msgpack is encoded to binary is beautiful, imho. It’s a simple format and is more suitable than json for transferring or serializing non trivial data (in part because you can just put blobs in there without need to escape them or base64-encode them).

            1. 5

              Possibly dumb question: is this different from the serialization that gRPC uses?

              1. 8

                Yes protobufs require a schema to use, while MessagePack and JSON are schemaless (or optional schema).

                After using protobufs for a long time, and seeing how people use them now, an optional schema seems attractive. They have apparently evolved awkwardly into dynamic / reflective use… Apparently gRPC has a reflection service which seems like an awkward workaround for the “schema distribution problem”.

                1. 4

                  The schema distribution has been pretty much solved, for us, by using a shared protobuf git repo. Go can load that as a dependency without much ceremony, and other langs with a submodule.

                  1. 4

                    Yeah that’s basically how they were designed to be used originally. They’re good for internal use, when you control both sides of the wire.

                    When you don’t control both sides of the wire it doesn’t really make sense IMO. That’s especially true when you’re talking between two different organizations, but sometimes it happens within the same company / org.

                2. 2

                  Yes, gRPC uses Protobuf’s. Broadly similar idea, different system.

                3. 4

                  I always wonder why MsgPack took off over similar tools like Cap’n Proto?

                  1. 21

                    they’re different things, msgpack is self-encoding and is basically binary json, whereas capnproto comes with an IDL, a RPC system, etc.

                    1. 4

                      That is an excellent question! Could be some kind of worse-is-better thing, due to how very compatible with JSON MessgePack is. But then, why not BSON? I would love to see some analysis of usage trends for these new binary serialization protocols, including Protobufs, CBOR, BSON, and Cap’n Proto.

                      1. 9

                        I want to like CBOR; it has a Standards Track RFC: 7049, after all. But, the standard does define some pretty odd behavior, last I looked. There’s a way to do streaming which, effectively, could require you to allocate forever.

                        I’ve not used Cap’n Proto, but have definitely used Protobuf. I greatly prefer the workflow of using MsgPack, but do also appreciate the schema enforcement being generated for you with Protobuf, it does get in the way early on in dev though. :/

                        1. 4

                          Funny, it’s the other way for me; I prefer to nail down the schema as early as possible. I thought Capn’p was pretty great compared to JSON, but I might not be as enthusiastic if I was trying to port over some sprawling legacy thing that never had a very well-defined schema in the first place.

                          1. 4

                            I do adhoc, investigative stuff far more often than I do work that goes into production. That being said, the last few projects I’ve had involvement in have started out with defining a schema in protobuf and moving forward that way. Prematurely, in both cases, probably. :)

                          2. 4

                            Yeah, but the streaming is optional and can be used as a ready-made framing format for your whole TCP session instead of inventing an ad-hoc one.

                            Seriously, stop inventing new protocols. Just use newline-separated JSONs or CBOR. Please? Pretty please?

                            1. 5

                              You can stream MsgPack objects back-to-back without any problem at all - it’s just concatenating objects one after the other on the wire, in a file, in a kafka queue, etc. You can do the same thing in CBOR, but CBOR makes it ambiguous if you should wrap the objects inside an indefinite-length array or something, and then says that some clients might not like that.

                              And this encapsulates the problem with CBOR - it defines a bunch of optional features of dubious value (tags, ‘streaming mode’, optional failure modes) that complicate interoperability and bloat the specification. The MsgPack spec is tiny and unambiguous.

                              It’s really a shame that CBOR was forked from MsgPack and submitted to the IETF against the will of the original authors. Now we have two definitions of essentially the same thing, but one of them is concise, and the other one is an IETF standard.

                              1. 1

                                It might have been better, yes. But you can use strict mode, the standard can be revised and IANA runs a tag registry. In this case, as much as I hate the saying, good is better than better.

                                Without a clear signal “use this” and proper hype some people might consider alternatives. And that will inevitably lead to custom formats. We need an extensible TLV format with approximately JSON semantics to move forward. Not more “key value\n” without escaping or dumping packed data structures to the wire.

                              2. 1

                                ready-made framing format for your whole TCP session instead of inventing an ad-hoc one.

                                Is this actually true? I assumed the streaming allowed for an arbitrary nesting, say:

                                [ {...},

                                Where you really can’t finish reading until that last ] forcing continued growth in allocations. I suppose your implementation could say “an outer array will emit the inner elements via a callback” … then you don’t have to allocate the world.. But, what if the inner object also uses streaming? Is that a thing that can happen?

                                Seriously, stop inventing new protocols. Just use newline-separated JSONs or CBOR. Please? Pretty please?

                                Probably don’t want newline delimited CBOR, or MsgPack. Might I suggest you stick a MsgPack integer before your object, decode that, and then read that value of bytes more and avoid delimiters altogether? Sure was nice back when I was using “framed-msgpack-rpc”…

                                1. 2

                                  I believe you can have nested streaming, though I can’t think of a practical use-case outside of continuously streaming “frames” of data. The endless allocation isn’t really endless, it goes until whatever is producing the data marks the end of it. If you’re pipelining your data handling, streaming like this is extremely useful because it allows you to stream large datasets while keeping things stateless/within the same request context. Most implementations don’t support streaming though.

                                  1. 1

                                    Just use newline-separated JSONs or CBOR. Please? Pretty please?

                                    Probably don’t want newline delimited CBOR, or MsgPack.

                                    I am not a native English speaker and I thought the comma made it ((newline-separated JSONs) || (CBOR)).

                                    1. 2

                                      I think you did everything right English syntax wise. I messed up reading your intention. Even with your clarification, though, I still think framing MsgPack with the number of bytes in the message is a great idea. :)

                                      1. 1

                                        And do you frame it using it’s compact integer notation, or fully wrap it in a blob or do you settle for an uint32be?

                              3. 8

                                I don’t have usage trends, but I wrote up a compare-and-contrast to all these various things a little while ago: https://wiki.alopex.li/BetterThanJson

                                Long story short, MsgPack is intended to be schema-less, or at least schema-optional, like JSON (and also CBOR). Cap’n Proto, like Protobufs, assumes a schema to begin with, which makes it much more complicated, with more tooling attached, and potentially much faster.

                                Also the more I look at CBOR and MsgPack in terms of encoding details the more similar they look to me; they both seem to very obviously share the same lineage. Take a look at the encoding of the example on the MsgPack website and the CBOR version.

                              4. 3

                                Slightly less juvenile name?

                                More seriously though, has MsgPack “taken off”? From what I can see most stuff offered as an API format is JSON. I guess for internal messaging something less fat on the wire is valuable.

                                1. 3

                                  It definitely has some level of adoption. I’ve been using Rocket (Rust web framework) and noticed it’s mentioned in the minor version release notes. Which leads me to assume that MsgPack has at least enough interest for issues with it to get fixed in a non-mainstream framework. Not a super strong basis for this conclusion, but I think we’ll continue to see interest in MsgPack growing.

                                  1. 6

                                    For how old it I think it has absolutely not taken off. Also Rocket is not even Rust-mainstream.

                                    1. 7

                                      Is any web framework Rust mainstream? 😛

                              5. 4

                                Json is human readable and can be edited with an editor which is a big, important feature.

                                1. 2

                                  Depends on your priorities. There may be cases where folks would prefer compression over legibility

                                  1. 2

                                    CBOR can be roundtripped. ie. convert from CBOR->JSON edit convert from JSON->CBOR.

                                    1. 2

                                      There’s nothing in RFC-7049 that restricts outright what keys can be used, so some CBOR map that uses integers as keys in a map can’t be round tripped through JSON, although the reverse, JSON round tripped through CBOR, is possible.

                                    2. 1

                                      Right. Msgpack is about as close to human-readable as a binary format can get, while actually having the benefits of a binary format. (BSON is a slightly more direct version of ‘binary json’, and that slight difference is enough to make BSON basically completely pointless.) I regularly sight-read msgpack using nothing but ‘less’, & it’s no harder than reading urlencoded parameter strings.

                                    3. 3

                                      This really reminds me bencode, the encoding of torrent files.

                                      1. 1


                                        Came here to say that as well ! bencoding is much more limited though (not UTF-8 support for example).
                                        But it is easier to read than msgpack as type identifiers are ASCII chars rather than byte codes.

                                        1. 1

                                          It’s invented where utf-8 wasn’t that popular, instead forcing the encoding, it just stores the raw encoded binary, regardless it’s utf8 or not, making the decoder easier to implement.

                                        1. 2

                                          I think there are two ‘diseases’ here. One is just a form of bloat, where a given wire format uses considerably more data in transmission, or computation when (de/)serializing, than necessary for a given message set or workload. That can be mitigated in lots of ways; the most common perhaps being compression. MessagePack or CBOR helps solve this.

                                          The other problem is not having a way to tell valid from invalid messages, or on the sending side to only generate valid ones. When it’s an issue, it’s generally a much deeper issue. Capnp, Protobuf, or ASN.1 can help solve some it, as well as the bloat problem, but it can’t re-architect a poorly specified application for you. If you’re committed to JSON and don’t have the bloat problem, JSON Schema can also help solve this. But it will probably be painful, no matter how you go about it.

                                          Anyway, there are plenty of comparisons out there, and the Wikipedia page is a good one. I’m more interested in the question of why certain technologies succeed or fail in the market, and I’m especially interested in network protocols, because they exemplify so-called ‘network effects’.

                                          1. 4

                                            Actually there are a couple more, ahh, symptoms of this disease…

                                            As you mention…

                                            • On the wire bloat.
                                            • Is it a valid message? (Parsing)
                                            • Is it a valid message? Conforming to agreed contract? (Schema)
                                            • Message semantics. What do the fields mean?

                                            Some more symptoms…

                                            • Can the parser handle maliciously crafted inputs? (eg. dbus protocol addresses this)
                                            • Forward and backward compatibility? aka Open Closed Extensibility.
                                            • Discoverability. (My favourite) How much work does a Human need to do to see the values in the message? How much more work does he need to do to understand what those values mean?
                                            • Default values. (Only available with a Schema) (Why does my mind always read “schema” as “screamer”?)

                                            I like the approach the CORE WG is taking with CBOR…


                                            … the idea seems to be to have the json like schemaless discoverable CBOR, but with a Yang SID https://tools.ietf.org/html/draft-ietf-core-sid-11 (Schema IDentifier)

                                            So given a blob of CBOR, you can find the SID and lookup online the Schema.

                                            ie. You can see the values even if you don’t have the schema.

                                            AND you can do the validation / semantics / defaults by automagically looking up the schema.

                                            Best of both worlds.

                                        2. 2

                                          This is cool. I only wish there’s a nice summary of how much faster and smaller it is in the median case.

                                          1. 1

                                            This link (via @JohnCarter in this thread) has some time and space comparisons. I haven’t dug into the methodology though


                                            To my mind it doesn’t look too convincing to use a (semi)binary format.

                                          2. 1

                                            Has anyone used the Python interface for anything serious? One of the things I value about JSON as a serialization format, for all its warts, is that the API is about as simple as it gets. It’s a dict containing pretty much arbitrary simple data types.

                                            It was hard for me to figure out whether or not I needed to understand what type the packed thing was before I unpack it from the API examples.

                                              1. 1

                                                MessagePack is used by Neovim to enable extensions in any language.

                                                There has been a proposal [1] to use Cap’n Proto instead, but it seems MessagePack is good enough. Maybe @jmk can chime in ;-)

                                                [1] https://github.com/neovim/neovim/issues/5526

                                                1. 1

                                                  Not sure what’s the use case for this as 18 vs 27 bytes is not a huge improvement, especially once it’s been gzipped, and you lose the convenience of the JSON format.

                                                  1. 2

                                                    I mean, it’s 50% bigger.

                                                    If I proposed a change that added some developer convenience but increased the amount of data our visitors had to download by 50% I would not get 5 minutes into the presentation before the looks on my colleagues faces told me to stop.

                                                    1. 2

                                                      For this small example yes, but it really depends on the data. If it’s a JSON object with many large strings in it, it might be smaller than 50%, and even smaller once gzipped. It also depends on how many times that data is transmitted.

                                                      If I choose an optimised format for whatever app I’m doing, it seems like only a tiny improvement with the big drawback of losing the convenience of JSON.

                                                      But that’s the question - what’s the use case. Maybe for some use case it makes sense but I can’t think of anything obvious.