1. 24
  1.  

  2. 4

    You could also use edn… easy to write (no commas!) And it has libraries in many languages

    1. 6

      I looked into EDN, it seems like a nice enough object notation. But it’s not trying to be an especially compact serialization format, so it doesn’t really have much overlap with JCOF, whose main purpose is being a compact serialization format at the cost of being hard for humans to read and edit.

      1. 6

        So if it’s harder for humans to read and edit, why not go for full capnproto or protobuf which is efficient and tools exist to make it easy to read and edit? What’s the target need?

        1. 6

          Cap’n Proto and Protobuf both use a schema. Whether to use a serialization format with or without a schema is a big complicated discussion, but if you want a schemaless format, I’m not aware of one that’s smaller than JCOF.

          1. 5

            CBOR is such a format. Using the example on the page, just a base CBOR encoding is 198 bytes, while 136 bytes is possible if you use the string reference extension. All without a schema.

            1. 7

              I wasn’t that impressed with cbor without the string reference extension.

              I had to search quite a bit to find a CBOR implementation which supports string references, none of the JavaScript ones did but I found a python library which does. CBOR with string references is clearly more compact than CBOR without, but it’s still usually not as compact as JCOF:

                8315 circuitsim.json
                2852 circuitsim.cbor (0.343x)
                2093 circuitsim.jcof (0.252x)
               51949 comets.json
               35639 comets.cbor (0.686x)
               37480 comets.jcof (0.724x)
               37996 madrid.json
               13411 madrid.cbor (0.353x)
               11959 madrid.jcof (0.315x)
              244975 meteorites.json
              119415 meteorites.cbor (0.487x)
               87083 meteorites.jcof (0.355x)
               56828 pokedex.json
               30909 pokedex.cbor (0.544x)
               23140 pokedex.jcof (0.407x)
              219635 pokemon.json
               60249 pokemon.cbor (0.274x)
               39650 pokemon.jcof (0.181x)
                 299 tiny.json
                 144 tiny.cbor (0.482x)
                 134 tiny.jcof (0.448x)
              

              I suppose if you don’t mind a slightly bigger size, don’t mind a binary format, don’t mind a format which isn’t human writable, and don’t mind using an extension which doesn’t seem that well supported by many CBOR libraries, but want an older, more established serialization format with libraries in more languages, CBOR with the string reference extension is a good choice. For a lot of situations, that’s gonna be the right trade-off.

        2. 5

          Not necessarily advocating, but for future reference: There are two compact serialization formats for EDN.

          First, is Transit: https://github.com/cognitect/transit-format It’s goal is to be an efficient transport encoding, but not necessarily ideal for data-at-rest.

          Second, is Fressian: https://github.com/Datomic/fressian/wiki It’s very similar to Transit, but intended for durable storage.

      2. 3

        I love the use of a string table, this seems pretty novel compared to other options. Obviously the text-based nature of your format precludes some JSON alternatives such as MessagePack and CBOR.

        1. 4

          I extended my test suite/benchmark to compare JSON, JCOF, MessagePack and CBOR, MessagePack and CBOR get only modest size gains compared to JSON:

          tiny.json:
            JSON: 299 bytes
            jcof: 134 bytes (0.448x)
            msgp: 217 bytes (0.726x)
            cbor: 221 bytes (0.739x)
          circuitsim.json:
            JSON: 8315 bytes
            jcof: 2093 bytes (0.252x)
            msgp: 5666 bytes (0.681x)
            cbor: 5678 bytes (0.683x)
          pokemon.json:
            JSON: 219635 bytes
            jcof:  39650 bytes (0.181x)
            msgp: 194685 bytes (0.886x)
            cbor: 194811 bytes (0.887x)
          pokedex.json:
            JSON: 56812 bytes
            jcof: 23132 bytes (0.407x)
            msgp: 46817 bytes (0.824x)
            cbor: 46866 bytes (0.825x)
          madrid.json:
            JSON: 37960 bytes
            jcof: 11923 bytes (0.314x)
            msgp: 31887 bytes (0.840x)
            cbor: 31882 bytes (0.840x)
          meteorites.json:
            JSON: 244920 bytes
            jcof:  87028 bytes (0.355x)
            msgp: 199004 bytes (0.813x)
            cbor: 198669 bytes (0.811x)
          comets.json:
            JSON: 51949 bytes
            jcof: 37480 bytes (0.721x)
            msgp: 39948 bytes (0.769x)
            cbor: 39530 bytes (0.761x)
          

          This makes sense, precisely because CBOR and MessagePack lack a string table and an object shapes table, instead including all the key strings for every object.

          I think there could be some modest gains from going with a binary format, but honestly, there’s not that much to gain unless you go with a format which operates on bits rather than bytes.

          1. 7

            Just gzipping the JSON gives very good results:

               790 Jul 15 23:49 circuitsim.json.gz
             14524 Jul 15 23:49 comets.json.gz
              6813 Jul 15 23:49 madrid.json.gz
             35831 Jul 15 23:49 meteorites.json.gz
              8503 Jul 15 23:49 pokedex.json.gz
              5933 Jul 15 23:49 pokemon.json.gz
               163 Jul 15 23:49 tiny.json.gz
            
            1. 5

              Yes, using a proper compression algorithm will always produce smaller data than just using a more efficient data encoding. If all you care about is the compressed size, and you don’t worry about the uncompressed size, you probably don’t need JCOF, CBOR, MessagePack, or any other serialization format which tries to be “JSON but smaller”. Clearly there is a desire for a more efficient way to encode JSON-like data.

              That said though, there is some space saving even when gzipping:

                162 tiny.json.gz
                139 tiny.jcof.gz (0.858x)
                806 circuitsim.json.gz
                696 circuitsim.jcof.gz (0.863x)
               5946 pokemon.json.gz
               4248 pokemon.jcof.gz (0.714x)
               6634 madrid.json.gz
               5645 madrid.jcof.gz (0.851x)
               7483 pokedex.json.gz
               7404 pokedex.jcof.gz (0.989x)
              14120 comets.json.gz
              14023 comets.cbor.gz (0.993x)
              35829 meteorites.json.gz
              33152 meteorites.jcof.gz (0.925x)
              

              The size reduction in tiny, circuitsim, pokemon and madrid when going from gzipped JSON to gzipped JCOF is about on par with the size reduction you get from going from uncompressed JSON to uncompressed CBOR or MessagePack, and both of those formats sell themselves as being smaller and more concise than JSON.

            2. 3

              The big downside of a string table is that it prevents streaming generation. Or, at least, requires you to stream, in parallel, into two buffers and then combine them, which means that you can’t stream through a compression or encryption algorithm, you need to buffer the data and then compress. It does allow streaming on the receive side, but requires that you keep the entire string table in memory until you have processed an entire document. If you want to extract a single node from a tree, it’s more expensive. That eliminates a lot of use cases where JSON is a good choice.

              If you’re using it as a message format for a well-defined protocol, then the comparison might want to include things like FlatBuffers and ASN.1, which are specifically optimised for this kind of use case, rather than as a generic JavaScript data serialisation format.

          2. 3

            s-expressions or canonical s-expressions anyone?
            https://en.wikipedia.org/wiki/Canonical_S-expressions

            1. 5

              How would you encode the provided example in sexps in a way that’s smaller than the JCOF?

              1. 1

                gzip. S-expressions for handling, gzip for size.

                1. 4

                  @mort said he’s optimizing for uncompressed size.

                  1. 1

                    you asked about size, I don’t mind. I find the goal as arbitrary as ‘optimised for using the least vowels’.

                    P.S.: re-reading my comment made me just frwon and laugh, because I am currently very scrupulous about the size of a executable binary I currently build.

                    1. 3

                      It’s not arbitrary for @mort, though, it’s the very specific constraint he had that motivated JCOF in the first place.

            2. 4

              most of the same semantics [as JSON]

              JSON doesn’t really have any semantics to speak of: when are two JSON values equal? When are they different? You could do better than JSON here by defining an equivalence relation over JCOF terms.

              1. 3

                Maybe it would’ve been better say it “the JSON data model”, which is what CBOR calls it. I’ll consider updating the readme.

                1. 7

                  Well, call it what you will, there’s no there there :-) The JSON data model is not well-defined enough to really be said to exist. I ranted a little on this topic here: https://preserves.dev/why-not-json.html

                  1. 2

                    I took the time to try to define semantics, in a way I think is consistent with how JSON parsers and serializers are often implemented: https://github.com/mortie/jcof/pull/3/files

                    I would love some feedback on that. In particular, is everything clear enough? Should I have a more thorough explanation of how exactly numbers map to floating point values? On the one hand, it would’ve been nice; but on the other hand, correct floating point parsing and serialization is so complicated that it’s nice to leave it up to a language’s standard library, even if that result in slight implementation differences. (While doing research on how other languages do this, I even found that JavaScript’s number to string function has implementation-defined results.)

                    1. 2

                      That’s really nice. You probably don’t have to pin down text representation of floats further, but you might say something like “the IEEE754 double value closest to the mathematical meaning of the digits in the number” if you like. It’s a bit thorny, still, depressingly, isn’t it! For preserves I pointed at Will Clinger’s and Aubrey Jaffer’s papers. It might also be helpful to give examples of JCOF’s answers to the questions I wrote down in my rant linked upthread. Also useful would be to simply point at the relevant bit of the spec for comparing two doubles: for preserves I chose to use the totalOrder predicate from the standard, because I wanted a total ordering, not just an equivalence, but I think the prose you have maps more closely to compareQuietEqual from section 5.11.

                      1. 1

                        I actually originally had wording to the effect of “the IEEE 754 double value closest to the meaning of the digits”, but I tried to figure out if that’s actually what JavaScript’s parseFloat does, which is when I found out that JavaScript actually leaves it up to the implementation whether the value is rounded up or down after the 20th digit. So for the string "2.00000000000000000013" (1 being the 20th significant digit), it’s implementation-defined whether you get the float representing 2.0000000000000000001 or 2.0000000000000000002, even though the former is closer. I could try to copy the JavaScript semantics, as that probably represents basically what’s achievable on as broad a range of hardware as is reasonable. I certainly don’t think I should be more strict than JavaScript. Though I was surprised that JavaScript apparently doesn’t require that you can round-trip a float perfectly with parseFloat(num.toString()).

                        I also originally tried looking into how IEEE 754 defines equality, thinking I could defer to that instead of talking about values being bit-identical, and I found the predicate compareQuietEqual in table 5.1 in section 5.11. I was never able to find a description of what compareQuietEqual actually does, however, nor did I find anything else which describes how “equality” is defined. If you have any insight here, I’d like to hear. (Additionally, my semantics would want to consider -0 and 0 to not be the same; this is actually why I use the phrase “the same” rather than “compare equal”. I wouldn’t want a serializer to encode -0 as 0.)

                        I also noticed that JavaScript doesn’t mention compareQuietEqual; it defines numbers x and y to be equal if, among other things,“x is the same Number value as y”, where “the Number value for x” is defined to be the same as IEEE754’s roundTiesToEven(x). And roundTiesToEven is just a way to go from an abstract exact mathematical quantity to a concrete floating point number. So that, to me, sounds like JavaScript is using bitwise equality, unless it uses “the same” to mean “compares equal according to compareQuietEqual”.

                        It always seems that once you dig deep enough into the specs underpinning our digital world, you find that at the core, it’s all just ambiguous prose and our world hangs together because implementors happen to agree on interpretations.

                        Regarding the questions, my semantics answer most of them, but I would need to constrain float parsing to be able to answer the second one. The answers are:

                        • are the JSON values 1, 1.0, and 1e0 the same or different? They are all the same, since they parse to the same IEEE 754 double precision floating point numbers.
                        • are the JSON values 1.0 and 1.0000000000000001 the same or different? Currently ambiguous, since I don’t define parsing rules. If we used JavaScript’s rules, they would be different, since they differ in the 17th significant digit, and JavaScript parseFloat is exact until the 20th.
                        • are the JSON strings “päron” (UTF-8 70c3a4726f6e) and “päron” (UTF-8 7061cc88726f6e) the same or different? They are different, since they have different UTF-8 code units.
                        • are the JSON objects {"a":1, "b":2} and {"b":2, "a":1} the same or different? They are the same, since order doesn’t matter.
                        • which, if any, of {"a":1, "a":2}, {"a":1} and {"a":2} are the same? Are all three legal? The first one is illegal because keys must be unique. The second and third are different, since the value of key “a” is different.
                        • are {"päron":1} and {"päron":1} the same or different? They are the same if both use the same UTF-8 code point sequence for their keys.

                        Once we have the float parsing thing nailed down, it would be a good idea to add updated answers to the readme.

                        1. 1

                          I think IEEE-754 floats is one area where the binary formats win over text. CBOR can represent IEEE-754 doubles, singles and halfs exactly (and include +inf, -inf, 0, -0, and NaN). When I wrote my own CBOR library, I even went so far as to use the smallest IEEE-754 format that would would trip (so +inf would be encoded as a half-float for instance).

                          For Unicode, you may want to specify a canonical form (say, NFC or NFD) to ensure interoperability.

                          1. 1

                            +1 for binary floats.

                            Re unicode normalization forms: I’d avoid them at this level. It feels like an application concern, not a transport concern to me. Different normalization forms have different purposes; the same text sometimes needs renormalizing to be used in a different way; etc. Sticking to just sequence-of-codepoints is IMO the right thing to do.

                            1. 1

                              I won’t specify a Unicode canonicalization form, since that would require correct parsers to contain or depend on a whole Unicode library, and it would mean different JCOF implementations which operate with different versions of Unicode are incompatible. Strings will remain sequences of UTF-8 encoded code points which are considered “the same” only if their bytes are the same.

                              Regarding floats, I agree that binary formats have an advantage there, since they can just output the float’s bits directly. Parsing and stringifying floats is an annoyingly hard problem. But I want this to remain a text format. Maybe I could represent the float’s bits as a string somehow though; base64 encode the 8 bytes or something. I’ll think about it.

                              1. 1

                                Hexfloats are a thing! https://gcc.gnu.org/onlinedocs/gcc/Hex-Floats.html

                                For preserves text syntax, I didn’t offer hexfloats (yet), instead escaping to the binary representation. https://preserves.dev/preserves-text.html#fn:rationale-switch-to-binary

                            2. 1

                              consider -0 and 0 to not be the same

                              Aha! Then you do want totalOrder after all, I think. When used as an equivalence it ends up being a comparison-of-the-bits IIRC. See here, for example.

                              1.0 =?= 1.0000000000000001

                              Wow, are you sure this is ambiguous for IEEE754 and Javascript? Trying it out in my browser, the two parseFloat to identical-appearing values. I can’t make it distinguish between them. What am I missing?

                              Per Wikipedia on IEEE754 (not Javascript numbers per se): doubles have “from 15 to 17 significant decimal digits precision […] If an IEEE 754 double-precision number is converted to a decimal string with at least 17 significant digits, and then converted back to double-precision representation, the final result must match the original number.” I used this info when cooking up the example.

                              Oh, wow, OK, I’ve just found RoundMVResult in the spec. Perhaps it’s aimed at those implementations that use, say, 80-bit floats for their numbers? But no, that can’t be right. What am I missing?

                              3 extra decimal digits is… about 10 bits of extra mantissa. Which gets us pretty close to the size of the mantissa of an 80-bit float. So maybe that’s the reason. Hmmm.

                    2. 2

                      One detail is the meaning of an object that uses the same key twice - what does that mean?

                    3. 2

                      This is exceptionally similar to avro. Why not use that?

                      1. 9

                        It seems like Avro is a schema-based format. JCOF, like JSON, is schemaless. It also seems like Avro is a binary format, while JCOF, like JSON, is plaintext and human-writable.

                        1. 5

                          To me, this is human-writable in the sense that a human could write it, but I don’t imagine they would ever want to. Not much of an advantage.

                          1. 4

                            Well, the readme doesn’t show it off as human-writeable, but this is a completely valid JCOF document: ;;{"people":[{"first-name":"Bob","age":32},{"first-name":"Alice","age":28}]}. Certainly more human writeable than a binary format.

                            1. 4

                              To me, “writable” implies modification, not just creation. A more optimal document would be very difficult for a human to modify.

                              1. 2

                                I find s-expressions incredibly hard to read and write, but lots of people do fine with a little practice. This looks writeable with practice too.

                      2. 2

                        Would be interested in seeing this run through the same same set of benchmarks as the formats compared in this paper.

                        1. 1

                          The lookup table reminds me of Huffman. How does it compare to zip or gzip, performance and space wise? I imagine you would obtain similar or better results with gzipped json?

                          1. 2

                            Gzipped JSON is significantly smaller than plain JCOF, so if you don’t care about the uncompressed size of your data you might as well use gzipped JSON. Though JCOF+gzip is a little smaller than JSON+gzip: https://lobste.rs/s/5edgkf/i_made_jcof_object_notation_which_encodes#c_ndatwo

                            I also tested with xz, and JSON+xz seemed about the same size as JCOF+xz.

                          2. 1

                            Side question for the author, what’s the motivation for schemaless in a video game context?

                            1. 2

                              A very easy way to create serialisers/deserialisers in JS is to write functions which create or consume a plain object representation, and then use JSON to create a string representation. So there’s no better reason than that it’s just easier. The alternatives seem to be to add a big serialisation library and complicate the build process with code generation from schema files , or write an ad-hoc schema-style format where the “schema” is represented as the state control flow in the code. Making a more efficient schemaless format with the JSON object model just seemed easier.

                            2. 1

                              So it’s a compression scheme?

                              1. 1

                                I suppose it’s a matter of where you draw the line between “compression” and “encoding scheme”. Is UTF-8 a compression scheme, since it represents text much more compact than using 21 bits for each code point? Or is it just a more efficient encoding scheme? Or maybe UTF-8 is the opposite of compression, since it represents most text less efficiently than what something like exp-Golomb-7 encoding would? Is JSON a compression scheme because it produces an output that’s more compact than representing the same data as XML?

                                In my opinion, JCOF is on the “encoding scheme” side of the line. But I don’t think there’s fundamentally hard definition which can say objectively whether something is an “encoding scheme” or “compression scheme”.

                              2. 1

                                I’m actually fairly impressed at this effort. The text format is comprehensible with effort, it’s self-describing in all the ways you would want, while giving major benefits for real use-cases (object-shapes being the primary one, I suspect).

                                1. 1

                                  I would add on that I appreciated the lack of hubris in the documentation and the discussion here.

                                  I’m publishing it because other people may find it useful too. If you don’t find it useful, feel free to disregard it.

                                  There isn’t a push to make this the new thing and there is an acknowledgement that other formats perform better in other use cases. I find that refreshing.