Well I see where you’re coming from 😁. There are significant differences though, it’s not human readable (although you can kind of read the hexdumps with a bit of habit); definitely not human writable; and much better typed. No confusion between strings and booleans and numbers for a start.
Did you consider also creating a dictionary for the key fields, e.g. "a"? If the same message is sent multiple times then you’ll end up sending all the field names every time. If you had a schema, then you could assigned “pointers” to the schema for each field and thus saved bytes. Or is that not relevant for your use case?
Using pointers (implicit sharing), yes, it can be compared to compression where words are reused. The first example shows that as a string is encoded only once.
There is no separate dictionary that could be shared between messages (although you could probably follow dCBOR42 and use a tag to represent a key into the shared dictionary). It’s not really relevant to my use case because the messages are fairly big and, using a schema (OCaml with a deriving library) we can encode directly into arrays instead of dictionaries. If you want compactness (and have a schema) you can also imitate protobuf and use (small) integers are dictionary keys.
It depends a lot on how much you rely on the sharing. If you start from JSON, turn it into Twine, and also use gzip on it, both might be comparable. But you can encode things into Twine in, say, 50KiB, that will correspond to an extremely large JSON values, because encoding a DAG into a tree can result in exponential blowup. For my use case, producing the JSON (or CBOR) and then compressing it would be very wasteful, and a lot slower.
I had trouble figuring out how the reference-style data structure relates to Twine.
Ooh, binary YAML!
- NOWell I see where you’re coming from 😁. There are significant differences though, it’s not human readable (although you can kind of read the hexdumps with a bit of habit); definitely not human writable; and much better typed. No confusion between strings and booleans and numbers for a start.
Is it correct to think of your deduplication of values as dictionary compression?
Did you consider also creating a dictionary for the key fields, e.g.
"a"? If the same message is sent multiple times then you’ll end up sending all the field names every time. If you had a schema, then you could assigned “pointers” to the schema for each field and thus saved bytes. Or is that not relevant for your use case?Using pointers (implicit sharing), yes, it can be compared to compression where words are reused. The first example shows that as a string is encoded only once.
There is no separate dictionary that could be shared between messages (although you could probably follow dCBOR42 and use a tag to represent a key into the shared dictionary). It’s not really relevant to my use case because the messages are fairly big and, using a schema (OCaml with a
derivinglibrary) we can encode directly into arrays instead of dictionaries. If you want compactness (and have a schema) you can also imitate protobuf and use (small) integers are dictionary keys.Would Gzip + json perform similarly at scale?
It depends a lot on how much you rely on the sharing. If you start from JSON, turn it into Twine, and also use gzip on it, both might be comparable. But you can encode things into Twine in, say, 50KiB, that will correspond to an extremely large JSON values, because encoding a DAG into a tree can result in exponential blowup. For my use case, producing the JSON (or CBOR) and then compressing it would be very wasteful, and a lot slower.