1. 42

  2. 18

    Looking forward to benchmarking against the upcoming sled 1.0, which is getting a pretty major redesign before its first stable release as well :) Nice to have more folks writing databases in Rust.

    Probably about 95% of the effort of building a storage engine is testing it sufficiently. You should really consider it untested until you’re using things like failpoints to trigger IO failures, as almost all outages due to storage engine bugs are due to error handling code that exists but was trivially broken due to never being triggered under test. Combining failpoints with quickcheck has caught thousands of bugs in sled over the years, and I recommend adopting similar approaches that give you nice deterministic, shrunken replay when it finds the inevitable hundreds of bugs over time.

    1. 8

      Another thought about this: if you’re interested, we could try to come up with a common page representation, shared in a separate crate. In my particular use case (Pijul), I can’t possibly use Sled, since (1) I can’t easily parallelise my operations and (2) I really need fast forks. But thinking about a common binary format could have the advantage that different libraries can at least read each other’s format.

      1. 5

        It might be hard to have a common format that does well for both approaches. Multi-process tends to benefit from the lmdb-style full CoW on each write to avoid extensive recovery costs for the participants. This means that the cost of merging an individual update into a node should be pretty fast.

        The new sled architecture tries to more aggressively compress nodes to reduce disk usage and in some cases avoid doing any search while traversing a node at all. If it detects that all children follow a fixed “stride”, then accessing a child is a branch-free lookup and the keys don’t need to be stored at all (other than the node’s lowest bound). But to facilitate this and other optimizations, it is a bit more expensive to merge a single item into the node, as more things need to be considered for potential optimizations. To make it practical, we keep an overlay of added and removed children on top of the backing node, and only merge it periodically to get some nice icache locality. The backing node can be arc’d so we only need relatively small new allocations for each update to the overlay, which currently just uses an im::Ordmap. The logical effect is the same: it looks like a CoW from far away, but we only logged a few dozen bytes for a small update instead of rewriting the entire page, and we didn’t need to copy very much in-memory but rather just CAS a pointer to update a functional tree structure. Overlays are also merged once things fall out of cache or after a certain amount of time to minimize the memory usage spend to reduce update effort.

        You’re more than welcome to use any of the sled node code (in node.rs) but I think that if you follow an LMDB approach that favors full CoW for minimizing multi-process concerns, it might be an awkward fit.

        1. 2

          That sounds super exciting, how fast is that new version on random benchmarks? on your best benchmark?

      2. 2

        Nice to have more folks writing databases in Rust.

        I thought the same when Sled was released, actually, Sanakirja is three years older.

        Probably about 95% of the effort of building a storage engine is testing it sufficiently

        I agree 100%, that is also my experience with Sanakirja since 2015.

        1. 1

          I’m not familiar with the codebase but I haven’t been able to find the tests relating to sanakijira, could you point me to them? I like to nerd out to how people test their db’s.

          1. 5

            I have very limited tests for this new release: https://nest.pijul.com/pmeunier/sanakirja-1.0:main/UAQX27N4PI4LG.BMAAA

            I’ve also used this script quite a bit: https://nest.pijul.com/pmeunier/sanakirja-1.0:main/ACB4A27ZMFLRK.BAAAA

            I just noticed that I had forgotten to push one important debugging tool, plotting the entire structure in graphviz: https://nest.pijul.com/pmeunier/sanakirja-1.0:main/T7QB6QEPWBXAU.BMAAA

            Older versions had 100s of different tests, and I still have to do this for this one. Unfortunately, some of this was lost in the different versions of Pijul. I probably still have the original code in a backup somewhere, and it certainly is on crates.io.

            I don’t have the same testing issues as Sled though, since your concurrency is much trickier than mine (kudos for that, it seems VERY challenging).

            1. 8

              I’m really jealous of that graphviz code. I keep meaning to implement something similar to go beyond the simplistic debug implementation that a sled tree currently has.

              I’ve been kind of surprised how little of sled’s bug distribution has been in the concurrent stuff. While concurrency bugs tend to be more annoying to debug (it’s not uncommon to need to sift through some multi-gigabyte log trace file with lots of filtering and notes for creating a forensic timeline) they tend not to be all that hard to get to pop out. Adding a simple randomized sleep right before any operation that could send information to another thread, combined with spinning up thousands of threads on a 96-core Arm server, is able to explore pretty staggering interleaving spaces in a short time, despite being totally naive and ultimately probabilistic.

              Another really effective test for getting things to pop out quickly is to run some workload that is constantly asserting some invariants and also performing operations that interleave with the keyspace where invariants are being asserted. Then add conditionally compiled randomized crash code that just aborts the process between write operations, and have it restart on every crash in a loop. This gets so many crash consistency bugs to jump out with extremely little code.

              But the quickcheck tests that generate sequences of operations to apply against the db and a model at the same time and in a totally single-threaded way most of the time (just a hashmap is a great model) has been the thing that has caused the most bugs to pop out, and it also has the advantage of giving you a minimized sequence of operations that triggered the bug, often making it obvious immediately what happened. It’s a small machine that teaches you what your biases are.

              Those 3 techniques can maybe be coded in 100 lines combined, but might keep you busy for months fixing all of the bugs that start jumping out :P They all avoid the “pesticide paradox” as well, which happens when you write some fixed unit or integration tests that don’t have any randomization, but then the bugs that exist in your code simply become immune to the tests because they aren’t evolving. You don’t really get that much safety from the test, you just have bugs everywhere you’re not exploring. Keep the bugs on the run by always running randomized tests :)

              1. 2

                Did you end up building any generic infrastructure for the stateful quickcheck testing? Hypothesis in Python has a nice setup for stateful property testing, but seems like in Rust libraries so far it’s all roll-your-own.

      3. 4

        Exciting! I hadn’t heard of this engine before. Yes, I’m another person who finds key/value stores sexy. Maybe because they’re more general-purpose than higher level database engines.

        Beating LMDB is impressive. But LMDB seems stale — AFAICT it hasn’t been significantly updated in years. I’ve been using libMDBX, a highly evolved fork with more features and better performance (or so the developer claims; I haven’t benchmarked it.) I’d be curious to know how it holds up against Sanakirja.

        I’m also curious about the overall design and how it uses memory-mapping. LMDB has some really clever tricks. Is there a design document somewhere?

        1. 3

          Is there a design document somewhere?

          Not really, but the code is split in a way that is supposed to make it approachable: if you look at the “put” and “del” modules here, they’re rather short. Put is even very short (182 lines).

          Then if you want to get into the details of how nodes of the B tree are implemented, I suggest looking at the Sized version first (but the details are a bit ugly).

        2. 3

          I learned a lot from reading this article, and it definitely did something to reignite my excitement regarding Pijul. Can’t wait for it to be “ready enough” for me to play around with confidence.