The headline assertion doesn’t pass the sniff test for me. The article expands its claims a bit:
If you want to filter messages very, very quickly, you can do it much faster if they’re in JSON as opposed to a strongly-typed “binary” format like Protobufs or Avro. […] Because for those formats, you have to parse the whole binary blob into a native data structure, which means looking at all the fields and allocating memory for them. You really want to minimize memory allocation for anything that happens millions of times per second.
JSON, however, allows stream parsing. Consider for example, the Jackson library’s nextToken() or Go’s json.Tokenizer. […] Which means you can skip through each event, only keeping around the fields that you’re interested in filtering on,
Those of us who’ve been around for a while may remember debating the merits of SAX vs DOM for parsing XML, and I think this is something similar. Tim has confused the API style of some particular parsers (DOM-only) with a property of the underlying data format.
While I can’t speak to Avro, I know that Protobuf supports event-based partial processing. The C++ API’s google::protobuf::io::CodedInputStream is the primary entry point for event-based Protobuf parsing, and it allows skipping over parts of the message that are uninteresting. This functionality is especially useful for processing heterogenous message types containing a common field (e.g. routing metadata).
Some other formats go even further. Cap’n’proto for example can map the binary representation directly in memory. That means you not only don’t really parse the values, you can skip to both messages and fields without even looking at them (beyond the type / length marker)
What’s actually being claimed is something like: “Formats that require you decode the whole message into a new allocation will be slower to filter than formats you can read in-place”.
JSON can be made plenty fast for pretty much any need I’ve seen, but I’d love to see a comparison of a JSON object stream filter vs a CANBus stream filter, for instance.
I think JSON’s always gonna be much slower than binary-based formats if you just want to read part of it, because in binary formats strings are length-prefixed while in JSON they are delimited by quotes. If you want to skip a key-value pair in JSON you have to scan the whole stream for the opening and closing quotes, while with a binary format you just skip n chars and save yourself a lot of comparisons.
To be fair, if we look at this seriously and pragmatically, serialisation to a format made to acompdate arbitrary structures is always going to be a mismatch if raw speed is of critical importance. This comparison of formats is silly because what we use them for is always more relevant than the format itself.
If you want raw speed, then create a basic custom format for your data using fixed length fields. This has always been the faster alternative to serialisation and always will be because it is fundamentally less complex. Even if using ASCII or any other text encoding.
That someone goes out of their way to make something like protobuf and try to selling as a compatibility format is mind boggling. JSON is at least intelligible and can be trivially reverse engineered. All programming languages have their own binary serialisation formats. The reason they didn’t push them as as a compatibility format among them is because it didn’t make sense.
I see people carrying and versioning around those proto files and comiting generated code, it makes me cringe. It.s like we are back to 2002.
But let’s be honest - apart from really trivial examples, we’re not looking at Json directly either. Almost every time in production I’ll be at least formatting it with jq to see the structure because we tend to remove the formatting whitespace everywhere. At that point, I don’t care about the format anymore - the tool can unpack it.
I see people carrying and versioning around those proto files and comiting generated code, it makes me cringe.
For me, it makes me think - that person won’t be woken up one night because a key is missing from some message, or because some counter turned out to be a null.
For me, it makes me think - that person won’t be woken up one night because a key is missing from some message, or because some counter turned out to be a null.
Yup. I’ll gladly take a hit in the development experience if it means a smooth and predictable runtime experience.
Agreed, and even without going into more structured formats like Avro and protobuf, the assertion is very dubious if you just consider JSON vs msgpack/CBOR. These formats give you the same “shape” of data, no schema, etc. except that msgpack/CBOR make it easier to skip over fields you don’t care about: no string escaping, length-prefixed strings, etc. Some JSON parsers are really fast mostly because people poured a lot of work into the format, but that’s it.
Reading between the lines (especially his note about reducing allocations), I think Tim fused a matching engine with the JSON parser. If so, he could select/filter elements without even allocating for the full key.
For another angle, Parsing Gigabytes of JSON per second uses SIMD to parse JSON. This particular trick only works with delimited formats like JSON, XML and CSV due to how they scan. It’s also another situation where they fused a query/rule engine with the parser to obtain extreme performance. It wouldn’t work with Protobuf, Avro, Capt’n Proto, etc.
Tim Bray is very smart, and he explicitly says that he’s not telling you everything. Give him the benefit of the doubt.
Even SIMD tricks can’t compete with not scanning at all. The really efficient formats like FlatBuffers, CapnProto and Fleece let you do direct indexing by internal pointers. This means you could look up a typical JSON path in an arbitrarily large message with only a few memory accesses, proportional to the path length.
At every level of the data an array can be indexed like a C pointer array resulting in the offset to the value, and a map can be indexed very quickly using binary search, again resulting in an offset.
The upshot is you can find a value in a 10GB encoded object by reading only two or three pages, whereas a JSON scan would probably have to read at least half the pages on average.
The assertion that format == codec is false, and he makes an abstraction inversion that invalidates his performance claim: The author wants an API that enables filtering while parsing, but not all parsers had that, so he built it on top of a whole-file parser. This is building a lower level API on top of a higher level API. Only the opposite is typically possible without sacrificing performance.
Merely a streaming decoder? That was disappointing. I hoped they at least did something more adventurous like plain string searching for "interesting_key": string in the data before JSON-decoding anything. Such hack would be taking advantage of JSON’s syntax, and wouldn’t be possible in most binary formats that don’t escape their values, and therefore don’t have enough redundancy to make such naive seek and re-sync possible (but OTOH in length + data binary formats it’s super quick to skip over data, faster than any SIMDified non-allocating JSON).
I think the biggest reason to prefer a textual format is the ease of fixing corrupted inputs. For textual formats, especially those conforming to at least context-free grammar, there are well known algorithms to get you the most plausible fixes. For binary formats that rely on TLV, I have not yet seen a reasonable algorithm that can repair corruption.
The headline assertion doesn’t pass the sniff test for me. The article expands its claims a bit:
Those of us who’ve been around for a while may remember debating the merits of SAX vs DOM for parsing XML, and I think this is something similar. Tim has confused the API style of some particular parsers (DOM-only) with a property of the underlying data format.
While I can’t speak to Avro, I know that Protobuf supports event-based partial processing. The C++ API’s
google::protobuf::io::CodedInputStream
is the primary entry point for event-based Protobuf parsing, and it allows skipping over parts of the message that are uninteresting. This functionality is especially useful for processing heterogenous message types containing a common field (e.g. routing metadata).Some other formats go even further. Cap’n’proto for example can map the binary representation directly in memory. That means you not only don’t really parse the values, you can skip to both messages and fields without even looking at them (beyond the type / length marker)
Yeah same, this is nonsense.
What’s actually being claimed is something like: “Formats that require you decode the whole message into a new allocation will be slower to filter than formats you can read in-place”.
JSON can be made plenty fast for pretty much any need I’ve seen, but I’d love to see a comparison of a JSON object stream filter vs a CANBus stream filter, for instance.
I think JSON’s always gonna be much slower than binary-based formats if you just want to read part of it, because in binary formats strings are length-prefixed while in JSON they are delimited by quotes. If you want to skip a key-value pair in JSON you have to scan the whole stream for the opening and closing quotes, while with a binary format you just skip n chars and save yourself a lot of comparisons.
To be fair, if we look at this seriously and pragmatically, serialisation to a format made to acompdate arbitrary structures is always going to be a mismatch if raw speed is of critical importance. This comparison of formats is silly because what we use them for is always more relevant than the format itself. If you want raw speed, then create a basic custom format for your data using fixed length fields. This has always been the faster alternative to serialisation and always will be because it is fundamentally less complex. Even if using ASCII or any other text encoding.
That someone goes out of their way to make something like protobuf and try to selling as a compatibility format is mind boggling. JSON is at least intelligible and can be trivially reverse engineered. All programming languages have their own binary serialisation formats. The reason they didn’t push them as as a compatibility format among them is because it didn’t make sense. I see people carrying and versioning around those proto files and comiting generated code, it makes me cringe. It.s like we are back to 2002.
There’s a lot of tools that will unpack protobuf for your. Starting with a list at https://stackoverflow.com/questions/6032137/how-to-visualize-data-from-google-protocol-buffer
But let’s be honest - apart from really trivial examples, we’re not looking at Json directly either. Almost every time in production I’ll be at least formatting it with jq to see the structure because we tend to remove the formatting whitespace everywhere. At that point, I don’t care about the format anymore - the tool can unpack it.
For me, it makes me think - that person won’t be woken up one night because a key is missing from some message, or because some counter turned out to be a null.
Yup. I’ll gladly take a hit in the development experience if it means a smooth and predictable runtime experience.
Agreed, and even without going into more structured formats like Avro and protobuf, the assertion is very dubious if you just consider JSON vs msgpack/CBOR. These formats give you the same “shape” of data, no schema, etc. except that msgpack/CBOR make it easier to skip over fields you don’t care about: no string escaping, length-prefixed strings, etc. Some JSON parsers are really fast mostly because people poured a lot of work into the format, but that’s it.
Reading between the lines (especially his note about reducing allocations), I think Tim fused a matching engine with the JSON parser. If so, he could select/filter elements without even allocating for the full key.
For another angle, Parsing Gigabytes of JSON per second uses SIMD to parse JSON. This particular trick only works with delimited formats like JSON, XML and CSV due to how they scan. It’s also another situation where they fused a query/rule engine with the parser to obtain extreme performance. It wouldn’t work with Protobuf, Avro, Capt’n Proto, etc.
Tim Bray is very smart, and he explicitly says that he’s not telling you everything. Give him the benefit of the doubt.
Even SIMD tricks can’t compete with not scanning at all. The really efficient formats like FlatBuffers, CapnProto and Fleece let you do direct indexing by internal pointers. This means you could look up a typical JSON path in an arbitrarily large message with only a few memory accesses, proportional to the path length.
At every level of the data an array can be indexed like a C pointer array resulting in the offset to the value, and a map can be indexed very quickly using binary search, again resulting in an offset.
The upshot is you can find a value in a 10GB encoded object by reading only two or three pages, whereas a JSON scan would probably have to read at least half the pages on average.
Exactly.
The assertion that format == codec is false, and he makes an abstraction inversion that invalidates his performance claim: The author wants an API that enables filtering while parsing, but not all parsers had that, so he built it on top of a whole-file parser. This is building a lower level API on top of a higher level API. Only the opposite is typically possible without sacrificing performance.
Merely a streaming decoder? That was disappointing. I hoped they at least did something more adventurous like plain string searching for
"interesting_key":
string in the data before JSON-decoding anything. Such hack would be taking advantage of JSON’s syntax, and wouldn’t be possible in most binary formats that don’t escape their values, and therefore don’t have enough redundancy to make such naive seek and re-sync possible (but OTOH inlength + data
binary formats it’s super quick to skip over data, faster than any SIMDified non-allocating JSON).I think the biggest reason to prefer a textual format is the ease of fixing corrupted inputs. For textual formats, especially those conforming to at least context-free grammar, there are well known algorithms to get you the most plausible fixes. For binary formats that rely on TLV, I have not yet seen a reasonable algorithm that can repair corruption.