I’m not an expert so maybe that’s that but I found the original paper kinda weird.
Copying my comment (and quote of the original paper) from a previous submission of the original paper:
Despite these apparent success stories, many other DBMSs have tried—and failed—to replace a traditional buffer pool with mmapbased file I/O. In the following, we recount some cautionary tales to illustrate how using mmap in your DBMS can go horribly wrong
Saying “many other DBMSs have tried — and failed” is a little weirdly put because above that they show a list of databases that use or used mmap and the number that are still using mmap (MonetDB, LevelDB, LMDB, SQLite, QuestDB, RavenDB, and WiredTiger) are greater than the number they list as having once used mmap and moved off it (Mongo, SingleStore, and InfluxDB). Maybe they just omitted some others that moved off it or ?
True they list a few more databases that considered mmap and decided not to implement it (TileDB, Scylla, VictoriaMetrics, etc.). And true they list RocksDB as a fork of LevelDB to avoid mmap.
My point being this paper seems to downplay the number of systems they introduce as still using mmap. And it didn’t go too much into the potential benefits that, say, SQLite or LMDB sees keeping mmap an option other than the introduction when they mentioned perceived benefits. Or maybe I missed it.
I’m maybe at the “apprentice-expert” level, and I also had issues with the original paper’s conclusions, although it did make a lot of good points. Real, production DBs do use mmap successfully. (You can add libMDBX to that list, although it’s a fork of LMDB.)
Anyway, this response article is solid gold. Great stuff. I’m not familiar with Voron — it’s interesting that it apparently uses a writeable mapping and lets the kernel write the pages. Other mmap-based storage engines I know of, including my incomplete one, use read-only mappings and explicit write calls. The writeable approach seems dangerous to me:
You don’t know when those pages are actually going to be written. (OTOH, you don’t really know that with a regular write either; and in any case you’re going to make some kind of flush/sync call later when committing the transaction, after which everything is known to be on-disk.)
A writeable memory map is vulnerable to stray writes. So unless your engine is written in a memory-safe language (and so is everything else it links with, and the app itself if this is an embedded db), a memory bug might cause garbage to get written into any resident page. That’s horrible, and nearly impossible to debug if you can’t reproduce it in an instrumented build…
I once did writable mmaps with backing store over NFS (meaning slow) and the buffer cache became 100% dirty pages since storage could not even come close to keeping up with some rapid creation of data. At least back in 2009, this led to a rapid Linux kernel crash/panic, essentially from an OOM condition. Even back then, I imagine a sysadmin could have tuned some /proc|/sys stuff to make it somewhat more robust, but I did not have admin rights. They may have hardened Linux defaults/behaviors to such scenarios by now, but my guess would be that “default tuning” remains risky on any number of OSes for mmap writes the IO cannot keep up with.
Note, this is a more subtle problem than OOM killer tuning/etc. Once you dirty a virt.mem. page, it becomes the kernel’s responsibility to flush. So, there is no process to kill. You just have to hope you can flush the page eventually. An NFS server that was fully hung could dash that hope indefinitely..(not to suggest that running a DBMS on top of any kind of network filesystem is a smart idea…just elaborating on the general topic of “hazards of writable mmaps”).
Yes, everything I’ve heard about mmap and network filesystems comes down to basically “never do this.”
(Which has implications for using memory-mapped databases as application data formats — since documents are not unlikely to live on network volumes. That’s one reason the toy storage manager I was working on supported either mmap or a manual buffer store.j
Re: your “basically”, one can be pretty sure application data is “small” sometimes, though, like in this little demo of a thesaurus command-line utility that I did (inspired by this conversation, actually): https://github.com/c-blake/nio/blob/main/demo/thes.nim (needs cligen-HEAD to compile at this time..). :-) :-)
At least with that Moby Thesaurus, space written with mmap is a mere 1.35 MiB and saving the answer unsurprisingly provides a many orders of magnitude speed-up. The bigger 10 MiB file can be more easily streamed to storage.
(Yes, yes, it could be another 4-8X faster still or maybe much more in a worst-case sense, but would get longer, uglier, and harder to understand, defeating its demo code purposes as well as needing to do a second pass & more random writing – not that 10..20 MB scale is a real problem.)
My takeaway: most of the read-end downsides the original paper points out are ultimately acceptable. It’s only the write-side problems with mmap-ing your database storage that are unacceptable.
I’m not an expert so maybe that’s that but I found the original paper kinda weird.
Copying my comment (and quote of the original paper) from a previous submission of the original paper:
Saying “many other DBMSs have tried — and failed” is a little weirdly put because above that they show a list of databases that use or used mmap and the number that are still using mmap (MonetDB, LevelDB, LMDB, SQLite, QuestDB, RavenDB, and WiredTiger) are greater than the number they list as having once used mmap and moved off it (Mongo, SingleStore, and InfluxDB). Maybe they just omitted some others that moved off it or ?
True they list a few more databases that considered mmap and decided not to implement it (TileDB, Scylla, VictoriaMetrics, etc.). And true they list RocksDB as a fork of LevelDB to avoid mmap.
My point being this paper seems to downplay the number of systems they introduce as still using mmap. And it didn’t go too much into the potential benefits that, say, SQLite or LMDB sees keeping mmap an option other than the introduction when they mentioned perceived benefits. Or maybe I missed it.
I’m maybe at the “apprentice-expert” level, and I also had issues with the original paper’s conclusions, although it did make a lot of good points. Real, production DBs do use mmap successfully. (You can add libMDBX to that list, although it’s a fork of LMDB.)
Anyway, this response article is solid gold. Great stuff. I’m not familiar with Voron — it’s interesting that it apparently uses a writeable mapping and lets the kernel write the pages. Other mmap-based storage engines I know of, including my incomplete one, use read-only mappings and explicit write calls. The writeable approach seems dangerous to me:
I agree the response article was good.
I once did writable mmaps with backing store over NFS (meaning slow) and the buffer cache became 100% dirty pages since storage could not even come close to keeping up with some rapid creation of data. At least back in 2009, this led to a rapid Linux kernel crash/panic, essentially from an OOM condition. Even back then, I imagine a sysadmin could have tuned some /proc|/sys stuff to make it somewhat more robust, but I did not have admin rights. They may have hardened Linux defaults/behaviors to such scenarios by now, but my guess would be that “default tuning” remains risky on any number of OSes for mmap writes the IO cannot keep up with.
Note, this is a more subtle problem than OOM killer tuning/etc. Once you dirty a virt.mem. page, it becomes the kernel’s responsibility to flush. So, there is no process to kill. You just have to hope you can flush the page eventually. An NFS server that was fully hung could dash that hope indefinitely..(not to suggest that running a DBMS on top of any kind of network filesystem is a smart idea…just elaborating on the general topic of “hazards of writable mmaps”).
Yes, everything I’ve heard about mmap and network filesystems comes down to basically “never do this.”
(Which has implications for using memory-mapped databases as application data formats — since documents are not unlikely to live on network volumes. That’s one reason the toy storage manager I was working on supported either mmap or a manual buffer store.j
Re: your “basically”, one can be pretty sure application data is “small” sometimes, though, like in this little demo of a thesaurus command-line utility that I did (inspired by this conversation, actually): https://github.com/c-blake/nio/blob/main/demo/thes.nim (needs cligen-HEAD to compile at this time..). :-) :-)
At least with that Moby Thesaurus, space written with mmap is a mere 1.35 MiB and saving the answer unsurprisingly provides a many orders of magnitude speed-up. The bigger 10 MiB file can be more easily streamed to storage.
(Yes, yes, it could be another 4-8X faster still or maybe much more in a worst-case sense, but would get longer, uglier, and harder to understand, defeating its demo code purposes as well as needing to do a second pass & more random writing – not that 10..20 MB scale is a real problem.)
My takeaway: most of the read-end downsides the original paper points out are ultimately acceptable. It’s only the write-side problems with mmap-ing your database storage that are unacceptable.