Great article! I’ve been reading the same book, and also the excellent survey “Modern B-Tree Techniques”, as I’m writing my own Key-value storage engine.
I agree with you that most data-structures books do a poor job of explaining B-trees, because they gloss over the difficulties of working with block storage. It’s also worth noting that, these days, B-trees can have advantages even in RAM because they have better locality of reference. So you could tune a B-tree around your CPU’s L1 cache size (instead o a disk’s block size.) There are also Cache-Oblivious B-Trees, which use a more complex layout to work optimally no matter what size caches the memory uses.
Cell layout within a page really is just like a memory allocator. I worked on a malloc implementation back in the 90s, and had major flashbacks while implementing this! But the nice thing is that everything is smaller, so instead of pointers linking free blocks you can just use 16-bit offsets.
The thing I’m still figuring out is “prefix compression” — usually lots of keys in a node will have a common prefix, so you can save space by factoring those out. So far I’ve got it working when all the keys in the node have a single common prefix, but that’s leaving money on the table, so to speak.
I wish I could link to my code — about 2000 LOC of clean C++ — but I’m doing this as an investigation project for $DAYJOB so it’s closed-source at least until/unless we productize it.
I came here to bring up prefix compression as even more important than suffix truncation, because over 99% of all nodes in a db are leaves (assuming > 100 index fan-out, which is pretty common when employing prefix encoding + suffix truncation). Prefix compression saves a lot more space. Suffix truncation is still really important for index node density, and thus cache performance and pretty much everything else though.
I tried a few approaches to prefix encoding in sled. At first I tried to be more clever (maybe what you’re thinking about trying to do when you say you’re “leaving money on the table”) than the simple approach described in Modern B-Tree Techniques, by storing with every key the specific prefix shared with the node’s low key. This caused search to be a bit more bug-prone and lower performance than the simpler technique. For large data sets, this also caused a huge amount of space to be wasted, because as a data set grows, the amount that you can expect all keys on a node to share the same prefix will increase. Storing the variable shared prefix used a TON more space for real datasets than just sticking with a single shared prefix length. Even bounding the prefix length to a single byte was incredibly wasteful compared to using a single shared prefix length.
Now, any time a node is split or merged, sled will figure out the common prefix length between the node low separator key and the high separator key, and if this changes in a merge or split, the stored keys are trimmed / grown to allow a single shared prefix length to be used for all keys. This makes key search way faster, which is a big benefit you get for everything that happens in the database at all. For the “infinity” node at the far maximum end of the keyspace, you can decide whether to recalculate shared prefixes more often based on the current maximum key or just skip prefix encoding altogether since an undefined maximum separator key doesn’t play with the simple technique.
The Redwood project in FoundationDB uses prefix compression for B Tree blocks but only uses the shared prefix for an entire block. IIRC tests for more granular prefix compression came back as too CPU intensive (since you have to check more keys than just the first/last).
That’s something that I haven’t delved into: the cost trade offs between writing (encoding) and reading (parsing) nodes. Like, is it worth it to spend more time compressing keys, which will both speed up searches and allow more keys to fit in the node? Sounds like in your use case it wasn’t.
This probably depends a lot on storage type. I hear on some servers I/O has become nearly as fast as memcpy, but I’m working on tiny things with microSD cards, where I’m currently seeing write speeds of 1MB/sec. 🐢
Another big revelation I forgot to mention: all the descriptions of B-trees describe the nodes as having a fixed capacity/fanout, as though this is some parameter of the tree as a whole that you have to choose up front. But when you have variable-length keys or values, i.e. almost always, it’s hugely inconvenient to use the same child or value count for every node. And as far as I can tell, it’s completely unnecessary. I just pack as many keys/values into a node as will fit in its page, and split it when they don’t, and merge adjacent nodes when they’re less than half full by byte-count.
The number of children of an interior node, or the number of key/value pairs in a leaf. Depends on the page size and key lengths, but commonly over 100.
High fanout is good because it reduces the number of nodes in the tree, which reduces I/O costs. It’s part of what makes B-trees so efficient. For example if the root and all interior nodes can be kept cached in RAM, then it takes at most one disk access to read any value h assuming the value doesn’t need an overflow page.)
Great article! I’ve been reading the same book, and also the excellent survey “Modern B-Tree Techniques”, as I’m writing my own Key-value storage engine.
I agree with you that most data-structures books do a poor job of explaining B-trees, because they gloss over the difficulties of working with block storage. It’s also worth noting that, these days, B-trees can have advantages even in RAM because they have better locality of reference. So you could tune a B-tree around your CPU’s L1 cache size (instead o a disk’s block size.) There are also Cache-Oblivious B-Trees, which use a more complex layout to work optimally no matter what size caches the memory uses.
Cell layout within a page really is just like a memory allocator. I worked on a malloc implementation back in the 90s, and had major flashbacks while implementing this! But the nice thing is that everything is smaller, so instead of pointers linking free blocks you can just use 16-bit offsets.
The thing I’m still figuring out is “prefix compression” — usually lots of keys in a node will have a common prefix, so you can save space by factoring those out. So far I’ve got it working when all the keys in the node have a single common prefix, but that’s leaving money on the table, so to speak.
I wish I could link to my code — about 2000 LOC of clean C++ — but I’m doing this as an investigation project for $DAYJOB so it’s closed-source at least until/unless we productize it.
I came here to bring up prefix compression as even more important than suffix truncation, because over 99% of all nodes in a db are leaves (assuming > 100 index fan-out, which is pretty common when employing prefix encoding + suffix truncation). Prefix compression saves a lot more space. Suffix truncation is still really important for index node density, and thus cache performance and pretty much everything else though.
I tried a few approaches to prefix encoding in sled. At first I tried to be more clever (maybe what you’re thinking about trying to do when you say you’re “leaving money on the table”) than the simple approach described in Modern B-Tree Techniques, by storing with every key the specific prefix shared with the node’s low key. This caused search to be a bit more bug-prone and lower performance than the simpler technique. For large data sets, this also caused a huge amount of space to be wasted, because as a data set grows, the amount that you can expect all keys on a node to share the same prefix will increase. Storing the variable shared prefix used a TON more space for real datasets than just sticking with a single shared prefix length. Even bounding the prefix length to a single byte was incredibly wasteful compared to using a single shared prefix length.
Now, any time a node is split or merged, sled will figure out the common prefix length between the node low separator key and the high separator key, and if this changes in a merge or split, the stored keys are trimmed / grown to allow a single shared prefix length to be used for all keys. This makes key search way faster, which is a big benefit you get for everything that happens in the database at all. For the “infinity” node at the far maximum end of the keyspace, you can decide whether to recalculate shared prefixes more often based on the current maximum key or just skip prefix encoding altogether since an undefined maximum separator key doesn’t play with the simple technique.
This is precisely the type of golden information/knowledge I like to read when reading comments here. Thank you for taking the time to write this.
The Redwood project in FoundationDB uses prefix compression for B Tree blocks but only uses the shared prefix for an entire block. IIRC tests for more granular prefix compression came back as too CPU intensive (since you have to check more keys than just the first/last).
That’s something that I haven’t delved into: the cost trade offs between writing (encoding) and reading (parsing) nodes. Like, is it worth it to spend more time compressing keys, which will both speed up searches and allow more keys to fit in the node? Sounds like in your use case it wasn’t.
This probably depends a lot on storage type. I hear on some servers I/O has become nearly as fast as memcpy, but I’m working on tiny things with microSD cards, where I’m currently seeing write speeds of 1MB/sec. 🐢
Another big revelation I forgot to mention: all the descriptions of B-trees describe the nodes as having a fixed capacity/fanout, as though this is some parameter of the tree as a whole that you have to choose up front. But when you have variable-length keys or values, i.e. almost always, it’s hugely inconvenient to use the same child or value count for every node. And as far as I can tell, it’s completely unnecessary. I just pack as many keys/values into a node as will fit in its page, and split it when they don’t, and merge adjacent nodes when they’re less than half full by byte-count.
What does fanout means in the context of b-trees?
The number of children of an interior node, or the number of key/value pairs in a leaf. Depends on the page size and key lengths, but commonly over 100.
High fanout is good because it reduces the number of nodes in the tree, which reduces I/O costs. It’s part of what makes B-trees so efficient. For example if the root and all interior nodes can be kept cached in RAM, then it takes at most one disk access to read any value h assuming the value doesn’t need an overflow page.)