We explored a number of these schemes for wasm. LEB128 turned out to be the best one because almost all the integers encode in a single byte, and so the branch predictor is very happy.
With a different distribution of values, another scheme may be faster.
Yes very interesting, the issue has some data sets showing that 90-94% of values are encoded in a single byte.
Although the summary seems to be not that “LEB128 is the best one”, but that “LEB128 is more familiar and it’s not obvious that the cleverer encodings are better on real world data”, which is fair.
And the comment linked in the issue is interesting and shows the other side, saying PrefixVarint is “massively faster”, which you would expect for other distributions: https://news.ycombinator.com/item?id=11263378
This serves the same purpose as LEB128, which is used in WebAssembly and protobufs (“varints”), except it has the property that you can tell how long the encoded integer is based on the first byte. (I think that the current protobuf maintainer lamented this issue because of the extra branches, e.g. for upb a long time ago)
With UTF-8, the leading byte tells you the integer length, except it doesn’t support arbitrary length integers. You’re limited to 7 or 8 bytes total (although UTF-8 itself is limited to 4 bytes because they only need 21 bits of chars)
I’m interested in comments on the XIP encoding. It seems like it’s less elegant/consistent, but probably faster, and it doesn’t require much code to implement anyway. As far as I see it has 3 special cases and then the general case.
The SQLite4 encoding is similar in concept but a bit simpler — only two magic ranges in the first byte. In my current project I’m using the next simpler encoding, with only one magic range: if the first byte is 248-255 then its value minus 247 is the number of bytes following. You read those bytes as big-endian and add 248.
Like the SQLite format, this has the benefit that encoded numbers sort lexicographically. That’s important for me because I sometimes use them as b-tree keys.
My format was based on a faulty recollection of SQLite’s … upon rereading it, I’m inclined to switch to it, because it’s better at medium-short numbers (3-5 decimal digits.)
One annoying detail of parsing is that you have to be careful when reading the extended number: it’s tempting to just do a 64-bit read and then shift+mask, but if it’s the very end of a buffer this can freak past the end and make Valgrind/ASan complain, or in the worst case segfault.
Ah yes the lexicographical property is nice!
It does seem to make sense to have 1, 2 or 3 special cases, given that the WASM data above showed that 90-94% of integers would be in the first special case.
In the sqlite encoding, 1 byte can handle the integers 0-240, while in the LEB128 and XIP case, it can handle 0-127. That seems like it’s probably a win sometimes?
For IRs it would be a win if you have between 128 and 240 types, or 128-240 methods in a class (ironically common for generated code like protobufs), etc.
TBH I remember seeing the sqlite4 encoding years ago but not understanding it. Now I think I see the logic – if I had a lot of time I wonder how it would turn out on those wasm benchmarks :) It has the extra lexicographical property but also seems like a denser encoding.
Very interesting. I wonder whether they, or some of the commenters, are aware of and/or did consider the decimalInfinite format . Some time ago, I implemented a SQLite extension based on that . It was a fun little project, and it seemed to perform well, at least when compared with other libraries, such as mpdecimal (used by Python). My repo’s source code contains a description of the format , but I’d recommend reading the paper.
DecimalInfinite has two nice features: it is compact (optimal, in some sense), and order-preserving.
Hm interesting, but this is a floating point decimal format right? It’s definitely related but I wouldn’t consider it an alternative.
Interesting paper! But the author’s use of the word “decimal” throughout is confusing: at first I thought it meant BCD, then I decided they really meant “real number” or “floating point”, then when I got to the details I saw it specifically means real numbers in decimal scientific notation; this preserves their exact form without the rounding error of binary conversion.
So the priorities seem to be exact fidelity to a human readable decimal, lexicograhic sorting, and infinite range. The variable length is just a side effect.
By contrast, varints (the general term for what the XIP article talks about) prioritize compactness and fast conversion, and sometimes lex. sorting. They don’t provide infinite range; all types I know about decode to an int64. They’re integer only, and most I’ve seen don’t even support negative numbers (which are trickier; ask me if you want details.)