Heh, reminds me of https://codearcana.com/posts/2015/12/20/using-off-cpu-flame-graphs-on-linux.html (where we discovered that using mmap for file io caused our dbms to contend on the virtual address space lock for our process in the kernel).
Update: the paper actually cites that blog post!
To my knowledge, the paper is quite off-base in its discussion of LMDB; it might be correct for System R, but the papers on LMDB describe the page management very clearly and it’s not based on multiple memory-mappings at all. Instead it’s copy-on-write to a free page, and there are two special pages in the file that store the roots of the trees, and are updated alternately. Updated pages are written back using file I/O.
A lot of the criticism of mmap has to do with attempting to use a writeable mapping. I totally agree with that; faulting changed pages back to the file is too uncontrollable, plus there’s the danger of stray writes corrupting the file.
I’m willing to accept mmap isn’t a good idea for a high-scale server DBMS running on big-iron CPUs. But that’s not the only use case for databases. I think mmap is great for smaller use cases, like client-side. It’s faster (LMDB smokes SQLite) and requires less tuning since there’s no read-buffer cache you have to set the size of. Allocating most of your RAM for caches is fine when the entire computer is a dedicated DB server, but it’s a terrible idea on a home PC and even worse on mobile.
The experimental key-value store I was building last year is inspired by LMDB but can use either mmap or a buffer cache for reads. (It always uses buffers for writes.) On my MacBook Pro, mmap is significantly faster, especially if I’m conservative in sizing the read cache.
It would be interesting if the authors repeated their benchmarks on less behemoth-sized systems, like maybe a Core i7 with 16GB RAM. I suspect that the cache-coherency slowdowns wouldn’t be as bad in a CPU with 8 cores instead of 64.
I also have fairly good experience with LMDB (compared to much more feature-rich SQLite).
I am wondering if using MMAP for application-level caching, will continue to improve over-time as OS swap subsystems continue to optimize SSD-based swap space handling.
I had seen that DragonFly BSD made specific concentrated effort in this area:
And, perhaps, systems like LMDB can take advantage of OS-specific tuning sooner-rather-than-later.
Many i/o access systems that optimized their performance (scheduling, etc) for spinning disks in the past, gradually change to take advantage for faster non-spinning disks. With that, prioritizing higher, the design choices that improve cache line consistency, and reduce context switching.
One of the problems I had on a previous project was that MMAP was actually TOO good at syncing data to disk. If you SIGKILLed the process, then the MMAPed pages would get synced, but other files like metadata that were maintained with IO system calls of course would not be. So the MMAPed data would be “ahead” of the metadata files. This caused endless problems. If you do use MMAP, then be careful to use it consistently for everything!
You should not rely on the order of things happening like that. “Using it consistently” would not actually solve the issue.
No, it was not ideal. This was around 2010 and I was working at a company that was paranoid about using open source, and had incredible deadline pressure that made it impossible to ever rework things. Mmap made things a little easier to start, and made things a lot tricker in the long run. If we had used SQLite or another database (which I actually tried to do), it would have been 100x easier.
I am glad they wrote this. I have made the mistake in the past as well. So many recent designs made the mistake that an experienced db kernel engineer would have shot down immediately.