1. 7

  2. 3

    In short, Redis: 1987 edition. The design maps almost perfectly; this strategy is still highly relevant. Great read.

    There are some major optimizations used in modern implementations of this strategy, like Redis:

    • finer grained locking
    • background snapshotting
    • log folding / rewriting

    None of these are remotely necessary at moderate scale, and of course all subvert the goal of a simple implementation. On a modern machine with solid state storage, they are even less necessary. SQLite for example essentially uses the described locking scheme, and performs well.

    Background snapshotting merely requires atomically creating a copy on write snapshot of the memory state and rotating the log file before beginning to write out a snapshot. At first glance that sounds nontrivial, but it’s quite straightforward: fork() your process.

    Folding logs incrementally rather than snapshotting is definitely the most complicated of these three optimizations. Instead of rotating logs on snapshot, incrementally rewrite the log to a minimal form. One possible implementation: rotate logs after they become a certain byte size. In the background, read two log files and stable sort the entries by key. Coalesce the entries for each key into one log record, and replace the old log files. Coalescing an insert followed by updates becomes an insert. Any run ending in a delete becomes a delete, or no op if the run started with inserts. A run of updates becomes a single update. Do this enough times, and you’ll end up with a snapshot equivalent: a log of only inserts. The learned reader will recognize this strategy as SSTable merging, slightly modified to operate on op logs.

    I also see one major error:

    If the working set can be retained in real memory, then the enquiry is extremely fast (this is the case with our name server and its 1 megabyte database). Otherwise the time will include the requisite number of page faults. The correct way to view this behavior is that the virtual memory system is being used to cache appropriate pages of the database in real memory. This should behave at least as well as any custom designed page-oriented caching scheme (actually better, since it has hardware support).

    Absolutely not. A cache oblivious data structure like a B-tree will perform far better. True, hardware support for page faults in the memory management unit will be faster than a userspace page cache, but those savings are negligible compared to reducing hard faults. Besides, you could just implement the in memory representation as a B-tree see the same MMU benefit.

    Like the other optimizations I described, B-trees are overkill for a moderate scale use case. I’m just perplexed by this obviously false claim. Even on 1987 hardware, B-trees would be better.

    1. 1

      I needed to implement a simple (though not small) database, and I was unsure how to persist data to disk as it was my first such project. I was thrilled to find this paper: it is short, to the point, and explains a simple and easy-to-implement technique; the only thing I’m sad about is that I did not think of it myself :)