1. 18
  1. 7

    I prefer LMDB’s approach (described in this 2012 paper.) I am not an expert on this stuff, although I’ve been building my own B-tree storage manager for a few weeks, so I’m learning rapidly. Durability is hard!

    TL;DR: SQLite either updates the file in place but saves copies of the overwritten pages for rollback, or writes to a temporary log file and periodically flushes it by doing the above. LMDB treats pages as copy-on-write, although once a page has been copied once it can then be overwritten while in the same transaction. It does some very clever free-page management to provide MVCC.

    In practice LMDB is a lot faster than SQLite. There was a proof-of-concept project that replaced SQLite’s b-tree engine with LMDB and sped it up a lot, but sadly it was an old version and wasn’t kept up to date.

    1. 8

      In practice LMDB is a lot faster than SQLite.

      I feel like this needs some qualifications. There are many dimensions to performance (read vs. write, physical media like SSD, NVMe, etc.), and also hard tradeoffs that different projects reconcile in different ways (durability vs. performance).

      I’m not an expert either but these comments give more color on the subtleties:

      https://news.ycombinator.com/item?id=18413743

      Some people say LMDB loses ACID semantics to be fast in certain situations; some people say RockDB (LevelDB fork) is massively faster, etc.

      It’s very easy to make a proof of concept that’s faster (namely because it doesn’t have to work all the time, and because you probably don’t know if it doesn’t work all the time!).

      1. 1

        I think you’re referring to this comment about ACID:

        • LMDB by default provides full ACID semantics, which means that after every key-value write committed, it needs to sync to disk. Apparently if this happens tens of times per second, your system performance will suffer.
        • LMDB provides a super-fast asynchronous mode (MDB_NOSYNC), and this is the one most often benchmarked. Writes are super-fast with this. But a little known fact is that you lose all of ACID[…]

        Yes, just like SQLite, commits are expensive because they provide durability. But the performance of commits (the full disk flush) isn’t nearly as bad with SSDs. If you find it a problem, you can leave a transaction open and just commit-and-reopen periodically; at worst you’ll lose changes in the latest transaction.

        You might also check out libMDBX, a fork of LMDB with a lot of optimizations and improvements. I was working with it last year, because development of LMDB itself seems to have stopped or stalled since 2015.

      2. 6

        The LumoSQL project has benchmarks for most SQLite variants. Apparently the SQLite LSM tree has improved a lot and is now competitive with the LMDB backend.

        1. 2

          The SQLite LSM tree is in limbo, last I heard. It was part of a “SQLite 4 prototype that was abandoned in 2014. It’s not used in SQLite 3, unless there was a recent announcement I’ve missed.

          1. 4

            It was ported as an SQLite 3 extension in 2017, discussed here. There’s a short 2020 mailing list thread on it that suggests it’s still maintained, but not under active development due to little user interest so far.

            1. 2

              Yeah right, I actually meant to say the default KV store.

              By January 2020 the LumoSQL project concluded: Howard’s 2013 performance work is reproducible SQLite’s key-value store improved in performance since 2013, getting close to parity with LMDB by some measures

          2. 2

            Thank you for the link to the LMDB paper!