1. 8
  1. 4

    So, the problem we’re trying to solve here is that JSON parsing is a bottleneck. Their solution is to do some cheaper pre-processing so they have less parsing to do. But then I have to ask:

    Why are we using a slow text format to begin with?

    I understand that not all past decisions can be reversed, and sometimes there’s no choice because of reasons¹. Still, before we even contemplate solving a problem, I believe we should think whether we could avoid it entirely.

    In this case: how about using a custom binary format instead of JSON? You may not even have to roll your own if something like MessagePack is fast enough despite being pretty generic. And if you do roll your own, chances your format could be simple enough that it wouldn’t take more effort than implementing raw filtering.


    [1]: Like, text is more readable and easier to debug when all your tools are text based. We started with JSON in our prototype and now our “production grade prototype” that calls itself an application is built around JSON and we’re stuck.

    1. 1

      JSON is the most common data interchange format. It’s hard to change that. Maybe if you control all of the structure, then you could, but not everyone can control how the data is presented to them, and how their data will be presented to others. Gigabyte sized JSON files usually don’t come from internal services, they will come from external ones, and similarly, such JSON files are usually produced for external systems, not internal ones. Could this be changed? Maybe, I’d say that SQLite is becoming common enough to replace many of the large sized batch datasets in use, but there is still nothing that is fairly universal for streaming data that wouldn’t create additional friction for others besides JSON.

      1. 3

        Granted, sometimes it is what it is.

        My experience has been different however. I’ve encountered JSON a number of times in my career, and every single time, it was used to exchange data between two subsystems we both controlled. I know that because we specified the exact shape of the JSON data we were exchanging. So JSON was not chosen because it was standard, it was chosen because it was perceived to be easy.

        Another thing to consider here is that JSON does not reduce friction that much: after parsing you still get fairly unstructured data: sure there are the nested objects and lists, but it still leaves much data being represented as raw strings, for which you’ll need a bespoke parser anyway. Heck, my most recent encounter involved encoding binary data in base64 just so I could send it through JSON! And on top of that there’s the schema your data conforms to. Whether you formalise it or not, the data you have will be structured in a certain way, and that too begets custom code.

        You’ll have custom code anyway, so why not go all the way and design a custom data format? It won’t be more than a couple hundred lines of code more than what you needed to do on top of JSON. It might even save you code in some cases (because your format is tailored to your data, you don’t have to fight it). Really, I suspect the friction you speak of is more of a mental block. Custom data format are not nearly as hard as they’re perceived to be.

        1. 1

          JSON is good because it’s mostly self-describing, and there are tools for whatever platform to use it. It’s of course isn’t as good if you control both sides, and I usually in such cases I weigh whether it is worth going to the next thing over. Creating your own custom data format isn’t as easy as you say, because often you’ll need to implement that in several different languages, and even then, making your own format more efficient than JSON isn’t that easy. You’d probably use an already existing format in most cases, but now you need to decide, which format? Now you have to weigh your options, and that is extra work that you need to do. JSON is usually good enough to offset the time investment to choose something else.

          1. 2

            JSON is good because it’s mostly self-describing

            Not quite. It’s textual. What makes text special is the sheer ubiquity of associated tools: editors, terminals… to the point where we came to believe that text is “human readable”, even though it’s not (we need viewers to read it, and editors to modify it).

            Creating your own custom data format isn’t as easy as you say, because often you’ll need to implement that in several different languages

            Sure, depending on what language we use. Though note that a simple binary format is easy to implement in C. Much easier than an equivalent textual format in fact. From there all you need is language bindings (another thing that’s ubiquitous is being able to talk to C).

            even then, making your own format more efficient than JSON isn’t that easy.

            Boy, you have no idea how inefficient text formats are. Here’s the crux: text formats are terminator based, while binary formats are length based (mostly). When you parse a textual format, you need to scan each character until you find the terminator. We have fancy techniques like finite state automata and LR parsing to make that faster, but the fundamental problem remains. Binary formats however tend to specify the length of their fields right there at the beginning. This lets you parallelise parsing if you ever need to, or skip fields that you are not interested in.

            You want to be more efficient than JSON? Start with TLV encoding: Type, Length, Value. It’s very simple, and it goes a long way.

            You’d probably use an already existing format in most cases

            Yes. Please everyone consider MessagePack.

            now you need to decide, which format?

            Like you didn’t need to decide when you chose JSON? If the choice of JSON was easy, the binary equivalent is just as easy: it’s MessagePack: like JSON, only it’s binary and leaner and faster.

            JSON is usually good enough to offset the time investment to choose something else.

            If you don’t want to think, the choice is already made: it’s MessagePack. And don’t worry about language support, your favourite language already has like 3 implementations.


            Okay, I’m being a liittle annoying over MessagePack, but I mean it: it’s basically JSON, only better. You just need a reader that’s not a text editor to visualise it.

            1. 1

              Msgpack is fine, but I think CBOR has a lot going for it. I appreciate the tag system for example. It should be easier or as easy to parse than msgpack.

              Sure, depending on what language we use. Though note that a simple binary format is easy to implement in C. Much easier than an equivalent textual format in fact.

              Please don’t. That’s how buffer overflows happen. Use a library and an existing format, like you say; hopefully it’ll even be fuzzed and battle-tested.

              Otherwise I agree, it’s interesting to see how people think that json is fast and “good enough” when really it sucks at storing floats, integers, binary blobs, etc. It’s only acceptable for lists, dictionaries, and unicode text, and even then you pay the cost of escaping/unescaping your text.

              1. 2

                Please don’t. That’s how buffer overflows happen. Use a library and an existing format, like you say; hopefully it’ll even be fuzzed and battle-tested.

                • Yes, C is unsafe, and we should consider using something else whenever possible. However, for pure computations like parsing or cryptography, it also has the advantage of being everywhere, effectively making it extremely portable. This portability is why I still use it even though its almost always technically inferior.

                • Yes, better use an existing format when available, provided it does what I need. I’ll take a look at CBOR one of those days.

                • What makes binary formats easier to implement also makes them safer. There’s less room for error in general, but one crucial thing is that most of the time, you can just read a small size field, then know right away how much memory you need to allocate. Then you just loop and you’re done. Textual formats however force you to allocate before you know the size, and that is so much more dangerous.

                • That said, in my experience fuzzing is the bare minimum where C code is concerned. Even my custom code is going to have property based tests, automatically generated correct and incorrect inputs and sanitizers and Valgrind and all that jazz. We’re talking about processing adversarial inputs from an unsafe language after all.