1. 27
  1. 16

    Could you go into more detail about why this round of embedded key-value stores are different than, say, Berkeley DB, which has been in the Python stdlib for decades? I’ve always been very confused about this hype, figuring it as just being the Google-backed version of an existing idea.

    1. 9

      A lot of the ones mentioned in the article were developed for use in servers. They are much more scalable than Berkeley DB and, with LSM trees, can handle much higher write loads.

      For embedded use, LMDB and its spin-off libMDBX are a lot faster than BDB, but have neatly-compatible APIs. (The guy who built LMDB has a paper on how his project, OpenLDAP, had so many performance problems with BDB that he wrote LMDB as a replacement.)

      1. 5

        The guy who built LMDB has a paper on how his project, OpenLDAP, had so many performance problems with BDB that he wrote LMDB as a replacement.

        That’s very interesting. I usually think of an LDAP backing store as the canonical example of something that’s almost entirely unconcerned with write performance. (Because directories are read so frequently and updated so rarely.)

        edit: Assuming this is the paper in question, it seems to me that read optimization was the focus of MDB development, not write loads. But it sounds like some of the design decisions they made helped quite a bit with write performance as well.

        1. 1

          Yes, my comment about higher write loads was specific to LSM trees, not LMDB. Sorry for the confusion.

        2. 3

          As I remember it the main reason for the move away from BDB was not performance, it was the license change.

        3. 6

          My college database course had us write a SQL engine on top of BerkeleyDB, and that was 8 years ago. I was surprised to learn just now that it was initially released in 1994. Page 6 of this paper from this year shows BerkeleyDB winning in most of the workloads they tested. (The paper is “Performance Comparison of Operations in the File System and in Embedded Key-Value Databases” by Hines et al.)

          1. 4

            Interesting paper, but they’re pretty vague about the setup: they ran the tests in a VM provided by the university, so that adds a lot of unknowns (like, what kind of mass storage? And was anyone else using the VM?), and didn’t specify what filesystem they used. I suspect they’d also get different results on Apple platforms and Windows.

            1. 1

              Agreed. I wish they posted their code so we could try it on other systems.

          2. 2

            It would be great to see more examples of situations where each is better. The article mentions:

            It can allow you to swap B-Trees (traditional choice) for LSM Trees (new hotness) as the underlying storage method (useful for optimizing write-heavy workloads).

            I don’t think LSM trees are strictly better than b-trees, if your only requirement is write heavy workloads. You also need to require the index characteristics an LSM tree provides (sequential IO), as well as be okay with the compaction characteristics. Cassandra, for example, uses this structure. I distinctly remember compactions being something that was tricky to optimize well (all though it’s been a looong time since I’ve used it).

            The original paper goes into more detail, but it’s been a long time since I’ve read it. Google might have made them more popular, but they weren’t invented at Google. Like anything in CS, it’s just a different data structure. There are tradeoffs, it depends on the use case, and ymmv.

            1. 4

              Sure I’m making generalizations because precision requires benchmarking and I’m not even talking about any specific code in this post. But if you just Google “lsm tree vs btree benchmarks” you’ll find a bunch and they mostly agree with the generalization I made.

              For example:

              Here’s a Mongo benchmark.

              Their takeaway: “If you have a workload that requires a high write throughput LSM is the best choice.”

              Here’s a TiKV benchmark.

              Their takeaway: “The main purpose for TiKV to use LSM-tree instead of B-tree as its underlying storage engine is because using cache technology to promote read performance is much easier than promote write performance.”

            2. 2

              Yeah this is a good question. I’d like to know myself.

              1. 8

                When you talk about key-value stores, there are basically 2 data structures that are implemented for storing the data: B-trees and LMS-trees. B-trees have been around for a long time (wikipedia tells me since 1970), and so the original embedded DBs all used a B-tree as their underlying data structure. There are differences in all other aspects (manual page cache vs mmap, variations of B-trees, etc), but they’re all implementations of a B-tree. B-trees are more efficient at finding/reading data than inserting or updating data.

                LSM-trees on the other hand were invented in 1996 (again, so says wikipedia). They were designed to handle writes/updates more efficiently than B-trees. The observation was that the “heavy” work of sorting can be done in-memory, then a merge-sort style operation can be performed, which incurs a sequential read from and write to disk, which is typically very fast. This spawned a number of implementations (LevelDB, RocksDB, etc) which too varied in a number of aspects, but most specifically in the strategy around when you merge data that has been persisted to disk. There are 2 main camps here: when a certain number of files-per-level have been written, and when a certain size-per-level has been written. These strategies vary in performance characteristics (write amplitude, space amplitude, etc) and can be chosen based upon the workload.

                1. 1

                  I’m aware of the data structures. :) But I’m not as aware about every major implementation and their ebb and flow in popularity/use over time.

            3. 5

              I red the article and didn’t get the answer. What is the big deal?

              Why is the staring point for comparison a concurrent hashmap rather than just a hashmap?

              1. 6

                Because databases are concurrent, and it’s extremely common for the data model / source of truth to be accessed by multiple threads.

                1. 2

                  Not so much when embedded.

                  1. 2

                    Why would embedded be less likely to use threads? Honest question, I have next to 0 embedded experience but the statement surprises me nonetheless.

                    1. 1

                      Because it is embedded?

                      He goes for concurrency safe data structure implementation without stating a single reason doing so. If you need a hashmap, then use a hashmap. If it needs to be concurrency safe for whatever reason, then either guard it with a semaphore manually and/or use a concurrency safe implementation. There is absolutely nothing inherently concurrent in needing a map. If such needs arrive, they have whatever specific motivation, irrelevant to this topic.

                      But to answer you question more directly. Embedded means owned by the application. The application can use it whichever way it wants. The direct and most common case being reading and writing normally whenever needed. As opposed to databases with a network layer, which made that design choice to specifically support access by simultaneous clients.

                      1. 7

                        Embedded vs networked is orthogonal to the question of concurrency. If you are under heavy OLTP load, you are going to want concurrent writes (regardless if you are using threads or async) to be able to maximally fulfill requests.

                        1. 1

                          Thank you, this makes much more sense to me.

                        2. 3

                          Embedded databases, mostly SQLite, are extremely common in desktop and mobile apps, which are almost always multithreaded to avoid GUI jank and to insulate against unpredictable network latency. Apple’s high level persistence API (Core Data) uses SQLite. My day job is architecting a mobile-optimized K/V store (Couchbase Lite) that’s used by a lot of enterprise apps. Concurrency is absolutely a requirement.

                          SQLite does get used in a lot in actually-embedded systems like automobiles and industrial controllers. Concurrency is also common there, because these systems often need to react in real time whereas their flash storage is relatively slow.

                2. 5

                  surprised to not see https://photondb.io/2022/08/15/bw-tree.html listed as “the new hotness”.

                  1. 2

                    I’m not too familiar with BW-trees, but from what I’ve read so far it sounds like they are mostly aimed at in-memory operations. This makes them less attractive in the typical embedded DB use case on a client or embedded device, which has less RAM to throw around.

                  2. 4

                    Wait a second, does FoundationDB have some additional usage mode that allows for embedding it in your app like eg RocksDB? Last time I used it, FoundationDB was client-server based.

                    1. 3

                      Ah crap you’re right. I misread docs. Editing…

                      Pushed and credited you. Thank you. :)

                    2. 4

                      almost all of these storage engines write to files on a filesystem which in turn takes these serialized tree-of-pages and writes them back into a tree-of-blocks on disk. Would be interesting to see these engines run on block devices directly and avoid the intermediate conversion. For example, could zfs’s DMU be used directly without the POSIX layer? There were talks on the mailing list a long time ago but don’t think anything came out of it. The kernel interfaces make it so hard to even experiment though. SPDK looks interesting.

                      1. 3

                        Ceph OSDs running the BlueStore backend use RocksDB directly on a raw block device for metadata storage.

                        Ext4 has extents, long ranges of contiguous blocks of a file, to save on block pointers, avoiding the problem you’re talking about. This only helps for single-file databases, so Postgres is out. However the directory contents of Postgres’ backing storage are almost certainly 100% resident in the filesystem cache, so it’s a non-issue in practice.

                        1. 1

                          Embedded databases like BDB and SQLite are most popular on client systems where either you don’t have permission to do block-level I/O, or you don’t want the end user to have to custom partition a disk just to use your app.

                          SQLite does get used in a lot of embedded systems where there’s no real OS to get in your way, and it has a virtual filesystem interface that allows the client to do the block-level I/O if it wants. But it turns out that doing block-level I/O on flash storage is a real pain if the controller hardware doesn’t have an adapter to take care of stuff like block subdivision and erasing and wear-leveling. And the simple SPI flash interfaces on microcontrollers generally don’t, in my experience.

                        2. 1

                          This sounds like the inverse of AWS DynamoDB. You build your own indices in the form (kind-of) of ${TABLE_IDENTIFIER}_${PRIMARY_KEY}_${ROW_IDENTIFIER}: ${ENCODED_VALUE} (where ENCODED_VALUE is the entire document), and it’s built (AFAIK) on top of Postgres (or maybe a different SQL DB).

                          Disclaimer: I’m not a database person at all, so this might not even be wrong.

                          1. 1

                            Thanks for writing this! I had one question:

                            Redis-compatible databases built on key-value stores

                            I couldn’t see a reference to Redis compatibility in the ZippyDB Facebook post, and searching online did not bring anything up either. Are there links you can share for making this Redis compatible part more explicit?

                            1. 1

                              You know, now I can’t find it. Maybe I misread something!