Check list to use mmap on files:
If you meet all the previous point, go ahead and use mmap. Otherwise trust the kernel and use pread/pwrite. mmap is amazing, but it’s a double-edge-nuclear-foot-gun.
mmap() is POSIX so any Unix system should support it (for instance, Solaris does). I agree, but would also add:
Note that map anonymous is not in posix. Actually the spec is quite lengthy and if you read it carefully, there are a lot of caveats and wiggle room left to implementations to only provide a minimal version.
That’s surprising (but it shouldn’t be—I mean, the mem*() functions from Standard C were only marked async safe a few years ago in POSIX). On the plus side, it appears that Linux, BSD and Solaris all support MAP_ANONYMOUS.
I meant that the way things go wrong with mmap are not portable. If you’re building a library or a long running daemon, this is critical. Other points I forgot in my list
I’m starting to realize that the mmap man page should be written with all said pitfalls :)
At work, we have a near perfect case use of mmap()—we map a large file shared read-only (several instances of the program share the same data) on the local filesystem (not over NFS) that contains data to be searched through, and it’s updated  periodically. That’s the only case I’ve seen mmap() used .
 By deleting the underlying file , then moving in the new version. Our program will then pick up the change (I don’t think the system supports file notification events so it periodically polls the timestamp of the file—this is fine for our use case), mmap() the updated filed, and when that succeeds, munmap() the old one.
 Unix file semantics to the rescue! Since there’s still a reference to the actual data, the file is still around, but the new copy won’t overwrite the old copy.
 Although it’s possible the underlying runtime system uses it for memory allocation.
When working on IO devices where the bandwidth is equal or greater than the memory bandwidth (on my desktop, I’m capped at 10-12G/sec on a single core, or 48-50G/sec for all cores), you’re eliding one copy of the data. Effectively reaching the upper bound (instead of half).
Trying to use version control systems on network drives is an activity well known to be a rich source of sorrow, everybody on the planet who wished to maintain sanity ceased doing that when CVS arrived.
They are trying to use git. They should sniff that the repo is on a network drive and stop right there with a “Don’t Do That”
Given that they are duelling with git garbage collection, I can’t but help feel that they are trying to do something very weird and outside the supported API of git.
ie. If you’re fight the tool… maybe don’t do that.
All this said, I have use mmap many times hugely successfully.
I got to the part about longjmping out of a signal handler before breaking down crying. No matter what you’re doing, that has to be a warning that you’re doing it wrong.
The post points out later on that this is unsafe, and switches to siglongjmp.
Can’t tell if you’re being facetious or not :)
I’m not being facetious
Just removed “try using mmap to read files” from my project TODO list. Thanks, @notriddle.
FWIW mmap can be incredibly useful for working with files. It effectively transforms a function-based api into a memory-based one, and C has a ton of native support for manipulating memory. You no longer have to manage a separate buffer (or buffers), and you can just assign to a memory address instead of seeking. I’d call it the “scripting language” of file IO: fast, dirty, and simple.
I agree, but due to the complexity of error handling in the cases described by the original article, you need to consider whether mmap is a net simplification or a net complication for the program you are writing.
In my case, I only need read-only access to the files I am processing. I get the same simplicity of coding regardless of whether I mmap the files, or allocate a buffer and read the entire file into the buffer. The extra complexity of mmap error handling isn’t worth it. If I was modifying these files, the situation might be different. Right now my files are small. If I start working with huge files in the future, I might need to reconsider the use of mmap.
Yeah, I’ve found the best use cases are with read-only files (where the errors don’t matter) and with files you create that have a fixed size (where you can just start over/fail noisily on error).
we are in early stages of using lmdb as a file cache actually.
it is a well optimized key value store with library bindings from many languages
I had it used it other project, and it was residing on a remote san drive, the write performance
was noticeable slower, than when on a local drive, but no corruption issues
Myself, I would be reasonably comfortable using mmap for reading / verifying/computing some aggregate values – basically a read-only use case.
But if writing to mmap, I would look for a library that does it well (and in portable way), and the use case needs to have really well understood performance advantages.
The combination of remote filesystems and mmap is just a minefield. Thumbnail generation in RawThreapee over NFS is very “fun”. For people on Linux it appears as just very long pauses, for me on FreeBSD it appears as full CPU usage on all cores :D