A problem with LTV for some formats, I think, is that the type influences how the length is represented.
For example, in CBOR (or msgpack, they’re pretty close) the first byte determines the type and, depending on the type, contains part of the value or part of the length. The type is the 3 most significant bits; the 5 least significant might contain a small integer or special value, or parts of the length which then follows. You simply cannot parse the length without parsing the type first.
In protobuf, messages are composed of (tag+type), length, value triples. The first byte contains the tag (record field) along with the wire type. But there too, length is not always an actual thing: integers are directly encoded after the first byte via a varint. What would that even look like with LTV?
The article didn’t convince me that LTV is better than TLV, but your comment did.
If we lead with the length, it can’t be specialized based on the type. It should always be the number of bytes following, not the number of entries of the type specified.
Not exactly, but it’s close. Having the size outside the type/Metadata allows parsers that can reliably skip frames they don’t understand, which is great for forward compatibility or encoding a wide range of stuff for analysis by different tools. It does prevent some size optimizations, but when that’s not tantamount, it can be worth it to not have the type system abstraction leak into the protocol framer.
Surely you can do the same with a TLV framing? You need the type to know if you can handle it and the length to know how far to skip, which comes first does not matter.
Possible, but it limits the extensibility of the type encoding because parsers have to know how wide it is even when they don’t understand it. It’s also more complicated because of that.
If parsers don’t know how to decode types they’re useless, and if they only know how to decode some types they risk false positives. LTV does not save you.
It’s also more complicated because of that.
TLV are only “more complicated” if you actively leverage capabilities LTV don’t have, at which point they’re long out of the running.
My experience is also very much the opposite: TLV’s “more complexity” is exactly where you’d expect it (in the type dispatch loop), but LTVs often can’t resist the allure of using invalid lengths as semantic tags, so you end up with multiple dispatch loops.
is TLV and it can be partially parsed. Just say “partial parsing” and leave out the TLV / LTV stuff, because it doesn’t add anything to the conversation.
Have you heard about the Asterix format used to exchange air traffic control information in Europe? It comes in a few dozen sub-types, each of which comes in a few tens of versions, not all of which are backwards compatible with each other. (Feast your eyes for example on the CAT004 message specification. Or don’t.)
This thing would be significantly easier to deal with if it were LTV. As it is, just figuring out how many records come in a single one of these messages is needlessly difficult and involves parsing the type of every sub-field in the record (the types of which are, of course, not backwards compatible between versions).
It’s likely to be a big saving though. For an extreme example, consider encoding a Boolean. If you want to encode arbitrary data, your length field needs to be at least 32 bits (unless you use a variable-length length encoding). You then likely need at least a 16 but type descriptor. Booleans are sufficiently common that you might encode true and false as separate types, so with an encoding where the type is the only fixed-sized header, you need two bytes for a Boolean but with a size-then-type encoding you need six bytes: 200% overhead. This isn’t so bad if you use something like LEB128 for your types and length encodings. Then you could get away with two bytes with length first, one for type-first-length-optional, so only a 100% overhead.
This will influence parsing speed but that matters most when you have nested types where you want to lazily parse things and even then the cost of a small lookup to know that a chunk is an array and has a length field is small.
Booleans are sufficiently common that you might encode true and false as separate types, so with an encoding where the type is the only fixed-sized header, you need two bytes for a Boolean
Actually, 1 is enough if you reserve two types for Booleans (one for True and one for False). Which I guess you can do most of the time, that with having 256 different type values to play with. This should make your case even stronger:
TLV with 2 reserved types for Booleans: 1 byte per boolean.
LTV with 4-bytes length: 4 + 1 + 1 = 6 bytes per boolean (500% overhead).
LTV with 2-bytes length: 2 + 1 + 1 = 4 bytes per boolean (300% overhead).
Actually, 1 is enough if you reserve two types for Booleans
That is exactly what the text that you quoted proposed, but it assumed a 2-byte fixed-width type descriptor. If you have a single-byte or a variable-length encoding (as proposed later in my post) then the size is smaller.
🤦 I read 16 bytes in your comment and I auto-completed to 1-byte for type, 1 byte for the boolean value. I’ve never worked with complex file formats, so I tend to assume that most of the cases 256 different (main) types is enough, in which case 1-byte type tags are a given.
The author is confusing field ordering with semantics.
In most record-oriented binary protocols you’ll see a fixed-size header first, with the header containing the length and (in some protocols) a type. It doesn’t matter which order those are in because the whole header is read as a unit, so whether it’s encoded as [length][type] or [type][length] just doesn’t matter at all.
The most well-known example would be IP. An IPv4 header (without options) is 16 octets, with the packet length in octets [2..3] and protocol (type) in octet 9. You wouldn’t read the header one byte at a time, you read the whole struct iphdr at once then decide what to do with the packet by inspecting its subfields (in any order you want).
This is also the traditional format used by Unix kernels to communicate with userspace. Linux’s FUSE protocol has struct fuse_request_header starting with (u32 length, u32 opcode), but the length is vestigial since each FUSE record is always transferred in a single syscall. Or ancillary data from recvmsg, where struct cmsghdr contains (length, type=(cmsg_level, cmsg_type)).
Where TLV comes into play is when you’ve got something like Protobuf, which mixes fixed-width and dynamic-width values in the same record. Parsing a Protobuf will yield something like [BITS32, BITS32, BITS64, DYN(length=1234), BITS32]. It wouldn’t make any sense to encode a sfixed32 as (4, BITS32) because the type itself contains all the information needed to know the length, so as an optimization the length is elided for the BITS32 type.
Yes exactly, the issue is whether you allow partial parsing, and whether the encoding of the length (or whether it’s present at all) depends on the type.
Calling it TLV vs. LTV is very confusing, because that’s NOT the issue.
As an analogy, I would compare CSV and TSV:
CSV doesn’t allow partial parsing, you can’t skip to row 100 without parsing every comma and quote. Fields can contain embedded newlines.
TSV disallows newlines in fields, so you can just skip 99 newlines, without looking at anything else, and then you have row 100.
Unfortunately TSV (as I understand it from your comment) still requires you to scan every single character for that newline. Well, at least the code can be simpler…
That’s true, but I don’t think it’s slow. Scanning for newlines is essentially free – the cost is loading from memory or disk
Although you could probably make the same argument for CSV … you can probably parse it as fast as you can stream it from memory
BUT that’s only true in languages like C++ or Rust
In Python, JavaScript, Erlang, etc. I think parsing TSV will be much faster than CSV in practice. I learned the hard way how all the little allocations get you :)
The history of JSON parsers in Python, JavaScript, etc. is similar (e.g. I remember how simplejson got into the Python stdlib).
It can be made arbitrarily fast in C or C++, but ironically JSON is more useful/common in the dynamic languages, and it’s definitely slow there
I’m not sure I like this analogy, I’ve written a generalized DSV parser, and I can assure you, there is some tool somewhere that emits what you don’t expect (like newlines in fields), and even with the small customer base we had at the time, exceptions and nonsense abounded.
P.S. if anyone cares, my favorite flavor is CSV with quoted fields and quotes within fields doubled. No other escaping is necessary
Nothing in your comment contradicts what I said: TSV can be partially parsed in the sense that you can skip to an arbitrary row without necessarily knowing where the cell boundaries are.
CSV can’t be partially parsed in this sense.
“A generalized DSV parser” doesn’t really relate to what I’m talking about.
I’ve seen nominally TSV files with newlines in quoted fields. That contradicts your statement that TSV doesn’t allow newlines other than between records. I also don’t think there’s a formal spec for either, and if there is, some LOBware somewhere will do it wrong anyway.
An alternative solution is to have a constant-length header and the length only refer to length of the data itserf without headers/footers.
For example, PNG chunks have Length (specifying the length of the chunk data excluding Length, Type, and CRC fields), then Type, then Data, if any, and then CRC.
LTV means you always encode a length, even when the type has a fixed size, which could be quite wasteful, depending on what your data is.
I imagine that encoding the size in the type tag (perhaps the bottom 3 or 4 bits encode the size of the type in sorta powers of two) or having a lookup table from type to size might be more efficient if you’re transmitting a lot of fixed size values that you can’t for whatever reason encode as arrays or large structs (where you could have one length and type tag shared covering many values).
I 100% agree with this, having the length on the outside means you are building a useful abstraction. I’ve built little toy protocols and used tokio_util::codec::length_delimited to do exactly that.
I think that one reason why people might structure it differently is that when it comes to keeping stuff in code, we usually structure it the other way. For example, if this is what your messages look like:
Now, your enum “in-code” works the other way: you have the type on the outside (represented by the enum variant) and then for every variant, you may have an embedded length (for example, the String as a variably-sized type has an internal length field and an allocation for the data).
So one form is useful on-the-wire, while the other variant is useful in-memory.
TLV is commonly useful on the wire because tons of types don’t need a length, so that’s wasted space. If you encode a u8 as TLV, you have length(tag) overhead because the length is subordinate to the type and you can just skip it for fixed-size types. As LTV, your overhead is length(len) + length(tag) of overhead.
A problem with LTV for some formats, I think, is that the type influences how the length is represented.
For example, in CBOR (or msgpack, they’re pretty close) the first byte determines the type and, depending on the type, contains part of the value or part of the length. The type is the 3 most significant bits; the 5 least significant might contain a small integer or special value, or parts of the length which then follows. You simply cannot parse the length without parsing the type first.
In protobuf, messages are composed of (tag+type), length, value triples. The first byte contains the tag (record field) along with the wire type. But there too, length is not always an actual thing: integers are directly encoded after the first byte via a varint. What would that even look like with LTV?
The article didn’t convince me that LTV is better than TLV, but your comment did.
If we lead with the length, it can’t be specialized based on the type. It should always be the number of bytes following, not the number of entries of the type specified.
This article is screaming for some kind of example where it would matter … it seems like a weak and mostly theoretical argument as is
https://wiki.wireshark.org/Development/LibpcapFileFormat
Not exactly, but it’s close. Having the size outside the type/Metadata allows parsers that can reliably skip frames they don’t understand, which is great for forward compatibility or encoding a wide range of stuff for analysis by different tools. It does prevent some size optimizations, but when that’s not tantamount, it can be worth it to not have the type system abstraction leak into the protocol framer.
thank you, great example
Surely you can do the same with a TLV framing? You need the type to know if you can handle it and the length to know how far to skip, which comes first does not matter.
Possible, but it limits the extensibility of the type encoding because parsers have to know how wide it is even when they don’t understand it. It’s also more complicated because of that.
If parsers don’t know how to decode types they’re useless, and if they only know how to decode some types they risk false positives. LTV does not save you.
TLV are only “more complicated” if you actively leverage capabilities LTV don’t have, at which point they’re long out of the running.
My experience is also very much the opposite: TLV’s “more complexity” is exactly where you’d expect it (in the type dispatch loop), but LTVs often can’t resist the allure of using invalid lengths as semantic tags, so you end up with multiple dispatch loops.
OK thanks for the example. I’d say that this is about whether a protocol can be partially parsed, which is different than TLV vs LTV.
i.e. agreeing with @jmillikin and @masklinn
https://lobste.rs/s/lfbey9/binary_formats_protocols_ltv_is_better#c_rdz5pc
Yeah. They said it better than me. I’m claiming that partial parsing is easier with LTV.
But it’s obviously not.
is TLV and it can be partially parsed. Just say “partial parsing” and leave out the TLV / LTV stuff, because it doesn’t add anything to the conversation.
Have you heard about the Asterix format used to exchange air traffic control information in Europe? It comes in a few dozen sub-types, each of which comes in a few tens of versions, not all of which are backwards compatible with each other. (Feast your eyes for example on the CAT004 message specification. Or don’t.)
This thing would be significantly easier to deal with if it were LTV. As it is, just figuring out how many records come in a single one of these messages is needlessly difficult and involves parsing the type of every sub-field in the record (the types of which are, of course, not backwards compatible between versions).
[Comment removed by author]
Depending on the type you may not need a length.
That can be a leaky abstraction though, the protocol layer needs to know about the consumer’s type system.
It’s likely to be a big saving though. For an extreme example, consider encoding a Boolean. If you want to encode arbitrary data, your length field needs to be at least 32 bits (unless you use a variable-length length encoding). You then likely need at least a 16 but type descriptor. Booleans are sufficiently common that you might encode true and false as separate types, so with an encoding where the type is the only fixed-sized header, you need two bytes for a Boolean but with a size-then-type encoding you need six bytes: 200% overhead. This isn’t so bad if you use something like LEB128 for your types and length encodings. Then you could get away with two bytes with length first, one for type-first-length-optional, so only a 100% overhead.
This will influence parsing speed but that matters most when you have nested types where you want to lazily parse things and even then the cost of a small lookup to know that a chunk is an array and has a length field is small.
Actually, 1 is enough if you reserve two types for Booleans (one for True and one for False). Which I guess you can do most of the time, that with having 256 different type values to play with. This should make your case even stronger:
That is exactly what the text that you quoted proposed, but it assumed a 2-byte fixed-width type descriptor. If you have a single-byte or a variable-length encoding (as proposed later in my post) then the size is smaller.
🤦 I read 16 bytes in your comment and I auto-completed to 1-byte for type, 1 byte for the boolean value. I’ve never worked with complex file formats, so I tend to assume that most of the cases 256 different (main) types is enough, in which case 1-byte type tags are a given.
Oops.
The author is confusing field ordering with semantics.
In most record-oriented binary protocols you’ll see a fixed-size header first, with the header containing the length and (in some protocols) a type. It doesn’t matter which order those are in because the whole header is read as a unit, so whether it’s encoded as
[length][type]
or[type][length]
just doesn’t matter at all.The most well-known example would be IP. An IPv4 header (without options) is 16 octets, with the packet length in octets [2..3] and protocol (type) in octet 9. You wouldn’t read the header one byte at a time, you read the whole
struct iphdr
at once then decide what to do with the packet by inspecting its subfields (in any order you want).This is also the traditional format used by Unix kernels to communicate with userspace. Linux’s FUSE protocol has
struct fuse_request_header
starting with(u32 length, u32 opcode)
, but the length is vestigial since each FUSE record is always transferred in a single syscall. Or ancillary data fromrecvmsg
, wherestruct cmsghdr
contains(length, type=(cmsg_level, cmsg_type))
.Where TLV comes into play is when you’ve got something like Protobuf, which mixes fixed-width and dynamic-width values in the same record. Parsing a Protobuf will yield something like
[BITS32, BITS32, BITS64, DYN(length=1234), BITS32]
. It wouldn’t make any sense to encode asfixed32
as(4, BITS32)
because the type itself contains all the information needed to know the length, so as an optimization the length is elided for theBITS32
type.Yes exactly, the issue is whether you allow partial parsing, and whether the encoding of the length (or whether it’s present at all) depends on the type.
Calling it TLV vs. LTV is very confusing, because that’s NOT the issue.
As an analogy, I would compare CSV and TSV:
Ditto for jumping to a cell within a field.
Unfortunately TSV (as I understand it from your comment) still requires you to scan every single character for that newline. Well, at least the code can be simpler…
That’s true, but I don’t think it’s slow. Scanning for newlines is essentially free – the cost is loading from memory or disk
Although you could probably make the same argument for CSV … you can probably parse it as fast as you can stream it from memory
BUT that’s only true in languages like C++ or Rust
In Python, JavaScript, Erlang, etc. I think parsing TSV will be much faster than CSV in practice. I learned the hard way how all the little allocations get you :)
The history of JSON parsers in Python, JavaScript, etc. is similar (e.g. I remember how simplejson got into the Python stdlib).
It can be made arbitrarily fast in C or C++, but ironically JSON is more useful/common in the dynamic languages, and it’s definitely slow there
I’m not sure I like this analogy, I’ve written a generalized DSV parser, and I can assure you, there is some tool somewhere that emits what you don’t expect (like newlines in fields), and even with the small customer base we had at the time, exceptions and nonsense abounded.
P.S. if anyone cares, my favorite flavor is CSV with quoted fields and quotes within fields doubled. No other escaping is necessary
Nothing in your comment contradicts what I said: TSV can be partially parsed in the sense that you can skip to an arbitrary row without necessarily knowing where the cell boundaries are.
CSV can’t be partially parsed in this sense.
“A generalized DSV parser” doesn’t really relate to what I’m talking about.
I’ve seen nominally TSV files with newlines in quoted fields. That contradicts your statement that TSV doesn’t allow newlines other than between records. I also don’t think there’s a formal spec for either, and if there is, some LOBware somewhere will do it wrong anyway.
An alternative solution is to have a constant-length header and the length only refer to length of the data itserf without headers/footers.
For example, PNG chunks have Length (specifying the length of the chunk data excluding Length, Type, and CRC fields), then Type, then Data, if any, and then CRC.
LTV means you always encode a length, even when the type has a fixed size, which could be quite wasteful, depending on what your data is.
I imagine that encoding the size in the type tag (perhaps the bottom 3 or 4 bits encode the size of the type in sorta powers of two) or having a lookup table from type to size might be more efficient if you’re transmitting a lot of fixed size values that you can’t for whatever reason encode as arrays or large structs (where you could have one length and type tag shared covering many values).
Lich is the best LTV format. :-)
A little weird that it has both length and end delimiter.
Reminds me of bencoding which has either length or end delimiter.
I think of it as an error correction byte.
I 100% agree with this, having the length on the outside means you are building a useful abstraction. I’ve built little toy protocols and used
tokio_util::codec::length_delimited
to do exactly that.I think that one reason why people might structure it differently is that when it comes to keeping stuff in code, we usually structure it the other way. For example, if this is what your messages look like:
Now, your enum “in-code” works the other way: you have the type on the outside (represented by the enum variant) and then for every variant, you may have an embedded length (for example, the
String
as a variably-sized type has an internal length field and an allocation for the data).So one form is useful on-the-wire, while the other variant is useful in-memory.
TLV is commonly useful on the wire because tons of types don’t need a length, so that’s wasted space. If you encode a u8 as TLV, you have
length(tag)
overhead because the length is subordinate to the type and you can just skip it for fixed-size types. As LTV, your overhead islength(len) + length(tag)
of overhead.At that point you might as well use a non-binary format like JSON - usually binary formats are used in order to be efficient.
Your objection makes no sense, my point is that TLV is more efficient than LTV.
That wasn’t an objection but agreement :) Perhaps I should’ve commented on the parent post instead.