1. 25
  1. 20

    Mimalloc and snmalloc were written at about the same time by different MSR teams, to solve very different problems, but independently came up with quite different solutions. The main use case for mimalloc was Lean, which is a language that creates a very large number of small objects. The main use case for snmalloc was Verona, a language that (once we finally write a working compiler) will do a lot of allocation and deallocation on different cores (any producer-consumer workload where the producer allocates buffers and the consumer deallocates is a pretty pathological case for most thread-caching allocators). Both ended up with a message-passing design.

    The main difference is the recipient of messages. In snmalloc, there’s an allocator that owns chunks and is the only thing that is the target for messages. In mimalloc, the messages are sent directly to the chunk. This means that mimalloc can have less contention on its message queues (unless you have two threads freeing to the same chunk or one thread allocating from a chunk while another is freeing, your message sends are uncontended). In contrast, if two threads are freeing memory allocated by the same thread, you will get some contention on the message queues in snmalloc. This is amortised in snmalloc by the fact that we batch things before sending, so you don’t typically see this contention in practice, and by the fact that our message queues don’t provide any kind of causal ordering guarantees and so can be implemented with a single atomic exchange (so have strong forward-progress guarantees).

    Daan was incredibly helpful when he learned about snmalloc and walked us through a whole load of interesting microoptimisations that mimalloc was doing. We’ve incorporated some of those ideas. My favourite one was a mechanism to move a conditional branch off the fast path. When you create a new thread, its TLS contains a pointer to an allocator. We can statically initialise that with a pointer to a global allocator. The global allocator doesn’t own any memory and so you can move the ‘is the TLS pointer initialised to a real allocator’ check to the slow path that you fall into when you can’t allocate any memory (which you always hit in the uninitialised case). In snmalloc this is done with a template function that takes the slow path as a lambda, which makes the lazy-init slow paths very clean (and generates very fast code).

    1. 3

      I love this approach to basically using the Null Object pattern to optimize out null checks! I’d never thought about it that way before and I’ll definitely be using this trick in my designs!

      1. 2

        Any plans for either mimalloc or snmalloc to support Linux rseq, or is that too nonportable? I feel like per-CPU caches will give allocators pretty much all the benefits of thread-local caches without their drawbacks.

        1. 2

          I don’t think it would help us. You need local caches if your design revolves around having a central allocator and reducing lock contention by providing thread- or CPU-local caches. This is typically where you end up if you wrote your allocator for single-core machines and then adapted it for multithreading. In contrast, snmalloc is an actor-model design. Each thread owns a local allocator, which owns local slabs for allocating individual size classes of object. If you’re allocating and freeing on a single thread, there are no atomic operations. If you allocate on one thread and free on another then the allocations are batched in the freeing thread until a threshold is reached and then sent to the allocating thread with a single atomic exchange. There’s no need for restartable sequences because there’s no read-modify-write and a wait-free algorithm that guarantees forward progress already.

          The thing that I would like to see adopted more widely is macOS’s MADV_FREE_REABLE / MADV_FREE_REUSE. This is better for allocators than MADV_FREE (which is widely supported). In the MADV_FREE case, pages still count to your RSS accounting even though you’re not using them because there’s no explicit signal that the page is reclaimed (if the kernel takes it away, then it should go from your RSS usage, but on the fast path where you write to the page, the kernel doesn’t see this).

          1. 1

            Yes, I can see why rseq isn’t very compelling for snmalloc, but I thought it might be more relevant to mimalloc. Isn’t it better for # slabs/arenas to scale with cores, not threads? (I know this is the original jemalloc design, but they ended up with tcaches anyway since arena access still requires locking.) With rseq, a thread almost always has atomic-free access to its CPU-local slab/arena (unless it gets preempted), so this seems like the best of both worlds (i.e. per-CPU slabs/arenas and tcaches).

            Your comment on MADV_FREE is interesting because I just gave up on it a few days ago and went back to MADV_DONTNEED for exactly the reasons you mentioned (the golang runtime had a similar experience). It’s pretty useless if I know that unused pages can be reclaimed at any time, but no memory measurement tools can see that. I don’t know how to fix this problem with the Linux approach, though (as you said, how do you reinstate the RSS charge when you madvise(MADV_FREE) a page and then touch it again before the kernel reclaims it?).

            1. 2

              Yes, I can see why rseq isn’t very compelling for snmalloc, but I thought it might be more relevant to mimalloc.

              Possibly, I’m not sure if that would help though, I think they use the same wait-free queue as snmalloc.

              Isn’t it better for # slabs/arenas to scale with cores, not threads? (I know this is the original jemalloc design, but they ended up with tcaches anyway since arena access still requires locking.)

              It might be better for space overhead but it’s hard to beat uncontended, no atomics, for performance. For space overhead, sizeclass allocators don’t generally suffer from fragmentation (unless you have really contrived allocation patterns), so the per-thread space overhead is pretty low. Most workloads that actually care about performance end up with one compute thread per core, possibly plus a few threads that spend all of their time blocked waiting for I/O, so the difference between per-thread and per-CPU isn’t that big in the case you want to optimise for, except that per-CPU is more expensive.

              With rseq, a thread almost always has atomic-free access to its CPU-local slab/arena (unless it gets preempted), so this seems like the best of both worlds (i.e. per-CPU slabs/arenas and tcaches).

              I really want to like rseq but I’ve never found a case where it actually helped. It’s a poor-man’s transactional memory and it requires you to structure your code such that you have a set of reads followed by a write that can fail, where you’re then notified of the failure. The problem is that this is exactly the structure that load-linked / store-conditional give you on load-store architectures and that you can fake with a 16-byte compare-and-exchange on x86 (compare-and-swap has an intrinsic ABA problem so fudge it by using a 64-bit free-running counter of updates to cause the compare to fail in the ABA case). In theory you can do multiple writes in an rseq but if you do then other code has to be able to observe and handle partial updates from you, which probably costs more. Anything I’ve been able to shoehorn into the rseq structure, I could write with CPU primitives more efficiently.

              Uncontended atomics are really cheap in modern hardware. If you already have the cache line on your core, they’re on the order of 3-5 times more expensive than a non-atomic operation, so it’s very easy for any alternative approach to add more overhead.

              I don’t know how to fix this problem with the Linux approach, though (as you said, how do you reinstate the RSS charge when you madvise(MADV_FREE) a page and then touch it again before the kernel reclaims it?).

              The macOS mechanism is really nice. You say that the pages are not in use and that just sets a flag in the VM data structure (and drops the pages from your RSS count). If the kernel encounters memory pressure then it acquires the lock on those pages and replaces them with a CoW zero page. You then have a converse operation that clears the bit (and increases the RSS count).

              This is cheaper than MADV_FREE because it doesn’t need to track dirty bits and guarantee ordering between writes to the dirty bit and attempts to reclaim the page. You need to update the VM map if you reclaim the page, but that’s on the slow path and the MADV_FREE_REUSE call acquires the lock on the VM page, so is easy to serialise with respect to attempts to reclaim the page.

        2. 1

          My favourite one was a mechanism to move a conditional branch off the fast path. When you create a new thread, its TLS contains a pointer to an allocator. We can statically initialise that with a pointer to a global allocator. The global allocator doesn’t own any memory and so you can move the ‘is the TLS pointer initialised to a real allocator’ check to the slow path that you fall into when you can’t allocate any memory (which you always hit in the uninitialised case

          No thread constructor or fault handler?

          1. 3

            I’m not sure I understand the question.

            1. 1

              You would like to run code code on a newly created thread to initialise a thread-local allocator object. This is in context of a specific problem you want to solve/speed up, so you have control of the environment. It seems to me that there are two fairly straightforward solutions which do not require a branch in the fast or slow paths:

              • you have control of when threads are created, so you simply run such initialisation code whenever you create the thread

              • leave the allocator object pointer initialised with NULL and never explicitly check if it is valid. Register a SIGSEGV handler which will detect such a situation. When it happens, initialise the allocator object and recover

              1. 6

                You would like to run code code on a newly created thread to initialise a thread-local allocator object. This is in context of a specific problem you want to solve/speed up, so you have control of the environment.

                In general, in C, we don’t have control over the environment. Threads are created by something in the C runtime. Thread-local storage is initialised by copying from the .tdata segment, so we can as the threading library to fill in a thread-local variable with the address of a global.

                In such a context, you have control of when threads are created, so can you not simply run such initialisation code whenever you create the thread?

                There aren’t generic hooks for doing this. If we were integrating snmalloc with a threading library directly then we could but we need to support the case where snmalloc is linked with your application and replaces the system malloc (or coexists with it, with the snmalloc functions prefixed with something). If we’re in an environment such as Verona where we control thread creation then we can pre-allocate the allocators but that’s not a general solution for a memory allocator. We do provide APIs that allow consumers to do this if they want to.

                On FreeBSD, there is a specific hook for a malloc implementation to use to be notified of thread teardown, which is invoked after other destructors for thread-local cleanup. We use this when we are being used as malloc. A generic init-malloc hook would be useful but I don’t know of any platforms that provide one.

                Alternately, can you not leave the allocator object pointer initialised with NULL and never explicitly check if it is valid, and register a SIGSEGV handler which will detect such a situation and, when it happens, initialise the allocator object and recover?

                Signal handlers are global. If we register a SIGSEGV handler then we’re broken by any other code that registers one. It’s also a bootstrapping problem. We need malloc to work quite early in libc initialisation, before any library-specific global constructors run (so we can’t register our handler in a global constructor) and if we could then we may still suffer from the fact that libc may allocate memory to install signal handlers (I don’t know if any do, but the standard doesn’t prohibit it) and so we’d hit the SIGSEGV before our handler was installed.

                We’d also need both platform- and architecture-specific code to handle this because we’d need to pull apart the mcontext structure to insert the faulting address. Given that our support matrix includes 10 operating systems on five architectures (many of which have 2-3 sub-architecture variants), that’s not something that sounds maintainable. Even on x86-64 Linux, the way of getting a specific register from the mcontext differs between musl and glibc, so I think we’d need around 50 different implementations of this code. Just installing the pointer in the TLS variable isn’t going to work because the compiler will load the variable into a register and then fault loading from that register.

                1. 1

                  In general, in C, we don’t have control over the environment

                  There aren’t generic hooks for doing this

                  But this isn’t ‘in general’; it’s trying to solve a specific problem for a specific application (or at least, that’s the way you’ve presented it). And if you did want to package it up, ‘you must run this function on threads you create’ is a completely reasonable ask from an allocator library.

                  need malloc to work quite early in libc initialisation

                  Not if it’s a random userspace malloc implementation. If it is the libc malloc, then you can very easily conspire with the thread creation or signal handling paths.

                  platform- and architecture-specific code

                  Fair enough. Though ‘maintainable’ hotspot is more portable than you, and does perform such tricks; but it is completely fair to not want the burden of such code.

                  1. 7

                    But this isn’t ‘in general’; it’s trying to solve a specific problem for a specific application (or at least, that’s the way you’ve presented it). And if you did want to package it up, ‘you must run this function on threads you create’ is a completely reasonable ask from an allocator library.

                    I’m not sure I understand. We are shipping a malloc implementation that you can statically or dynamically link on 10 platforms on 5 architectures.

                    Not if it’s a random userspace malloc implementation. If it is the libc malloc, then you can very easily conspire with the thread creation or signal handling paths.

                    If it is a malloc implementation, then libc needs to be able to use it very early but does not know how it works. If it is a non-malloc memory allocator then libc does not depend on it or care that it exists. We are providing both. If you build the shared library, you can either link to it before libc or LD_PRELOAD it and you’ll use our malloc with your system libc. If you statically link, the same is true.

                    If you include our headers then you can define a customised memory allocator for your use (this is what we do for Verona) either with a malloc or C++ new or Rust new-compatible interface if you build the right files in the override directory or with your own interface to the underlying allocators.

                    You can also integrate it with libc (my FreeBSD dev box uses it instead of jemalloc) but we can’t make using a different libc a requirement for folks that just want a different malloc and given that we work with at least 10 different libc implementations it would be very difficult for us to support that even if it were a sensible constraint.

                    Fair enough. Though ‘maintainable’ hotspot is more portable than you, and does perform such tricks; but it is completely fair to not want the burden of such code.

                    Hotspot is a few orders of magnitude more code and is not generally used as a library and so is able to take control over the entire process, which means that it can do these tricks. The Android runtime does as well and explicitly prevents NDK code from installing signal handlers to make this possible. A malloc implementation can’t just say ‘sorry, these bits of the C spec aren’t for you’. It’s not really a meaningful comparison. There are a lot of things a program can do that a library can’t.

                    Oh, and I’m not sure if I’d be willing to claim that HotSpot is more portable. It doesn’t yet run on any CHERI architectures, for example, though I’m looking forward to someone doing the OpenJDK Morello port next year. We don’t (yet) support s390 or AIX, but that’s because no one has cared (patches welcome!). I believe AIX would work with the POSIX PAL and someone familiar with s390 should be able to write an AAL in a couple of minutes.

          2. 1

            Verona, a language that (once we finally write a working compiler) will do a lot of allocation and deallocation on different cores (any producer-consumer workload where the producer allocates buffers and the consumer deallocates)

            Seems like an alternate approach would be appropriate. EG: consumer doesn’t free message; instead sets a flag on it. When producer needs to allocate a new message, it looks through everything it sent out until it finds something with a flag set. (Shuffle everything without to the back to avoid looking at the same messages repeatedly. Small messages should just be copied or traced so no problem w/disproportionate work.) OOM condition can look for threads that are currently blocked, make sure they don’t wake up unexpectedly, and steal messages from them.

            Malloc is a garbage API.

            1. 15

              Seems like you’re falling into the common-in-engineers logical fallacy of believing that problems you haven’t worked on are shallow. You haven’t shared your background with us, but you’re arguing with someone who works on a pretty deep problem. You’re presenting your position as off-the-cuff thoughts (no references, or mentions of your prior work), but assuming you’ve thought of things David hasn’t in months/years of work.

              That comes off as quite arrogant, and I know from experience that it can be super aggravating to the person on the receiving end, as there’s a (usually unintentional) implication that they must be stupid for not having thought of these things immediately like you did. Thus this can easily turn into a flame war.

              This happens a lot in any tech forum, and I still struggle with it myself.

              1. 5

                OT: This is pretty much how epoll came about. Linus sketched a generic polling interface off the top of his head, in a throwaway post on LKML, and someone (Davide Libenzi) just implemented that more or less verbatim, apparently without even trying to consult prior work (/dev/poll, kqueue). If you’re wondering how such a broken design could have ever made it into the kernel, well, Linux doesn’t do design reviews, on principle. Patches or GTFO!

              2. 7

                Your proposed solution is O(n) in terms of in-flight messages and has terrible cache interaction (scanning the flags will pull a cache line back to the producer’s core, potentially in the middle of the consumer using it, which can incur a 300 cycle penalty), whereas snmalloc is constant time and plays very nicely with the cache coherency policy. A lot of designs do work around this limitation in mainstream allocators. The most common is to provide a producer-consumer ring buffer where you pull the responses from the consumer and reuse buffers. This doesn’t let you dynamically scale your workload though and forces you to combine your back-pressure and memory-allocation mechanisms.

                In the best case, you are going to end up building a bespoke mechanism for returning memory to the producer that performs as well as the mechanism in snmalloc. I don’t see an advantage in making every consumer of an allocator solve a problem that we can solve generically.

                Oh, and snmalloc’s approach also benefits programs that do a lot of thread-local allocation because all of our allocator data structures aside from the message queue are entirely thread local and so we don’t need any atomic operations on the fast path. You can see from our paper how much we outperformed other allocators two years ago and the implementation is now faster than the paper.

                1. 2

                  When I was designing an allocator I was really intrigued by snmalloc’s message-passing design, but after looking at benchmarks vs. mimalloc it didn’t seem to have a decisive advantage over simple lock-free remote deallocation, so I concluded the complexity probably wasn’t worth it. Maybe the representative (or motivating) workloads are just different? Message-passing does seem more scalable than direct remote deallocation, so maybe the benchmarks I saw just didn’t have enough concurrency.

                  1. 3

                    I think we’re beating mimalloc in benchmarks now (the snmalloc2 branch is where all of the work is happening and I’m not sure if anyone has run the benchmarks again recently), though it’s a bit workload-dependent. Both snmalloc and mimalloc have a message-passing design, the key difference is the target of the messages. In mimalloc, messages are sent directly to the chunk that owns the allocation (and message queues are combined with free lists), in snmalloc they’re sent to the allocator that did the allocation.

                    There are a few advantages to snmalloc’s design that aren’t really to do with performance. The big one for me is that it provides a single funnel for messages when you want to allocate across trust domains. One of the use cases for snmalloc is being able to allocate within a sandbox, from outside. There’s a proof of concept here. You can create a sandbox with a per-sandbox instance of snmalloc’s Allocator class and instances of snmalloc used to allocate memory from inside the sandbox via the normal malloc interface. You can call free on memory allocated by either from either and there’s a single place where the outside allocator needs to do security checks. In a CHERI context, where we can create sandboxes just by reserving a bit of virtual address space, this is a really nice property and it’s going to be useful with SFI use cases as well (for example, WASI sandboxes).

                    We’ve had a report from one Rust codebase switching to snmalloc and getting almost a doubling in overall transactions-per-second counts. I’m not sure what their baseline allocator was though. They’re highly parallel with a lot of work stealing and balancing, which is exactly the model we have for Verona, so I expect that the advantage will be similar.

                    In the absence of (dynamic) concurrency, snmalloc’s design has the advantage that it doesn’t need any atomic operations on the fast path. The chunks are always exclusively owned by a single allocator, which is owned by a single thread. This means that getting things from a free list or putting them back is a purely core-local operation.

                  2. 1

                    O(n) in terms of in-flight messages

                    Only if the producing thread increases its sending rate without bound, which obviously cannot happen. Generally the rate of message production will be similar to the rate of message consumption (duh); so given a modestly overallocated heap, you will almost always hit a message which is already free.

                    cache

                    Again, you will usually hit a message that another thread is already done with, first thing or very close to. And hardware permitting (so no x86), you can even use relaxed orderings which will incur no penalty.

                    1. 3

                      Only if the producing thread increases its sending rate without bound, which obviously cannot happen. Generally the rate of message production will be similar to the rate of message consumption (duh); so given a modestly overallocated heap, you will almost always hit a message which is already free.

                      Depending on the order that you visit messages and the rate of progress. If you have a single consumer that processes message in order (snmalloc’s model supports multi-producer, multi-consumer, so you can have arbitrary fan-out and fan-in, which is very important for building scalable systems) then you only ever need to visit the first message because if it isn’t finished then none of the later ones will be.

                      If you send one message, then immediately check if it’s done, then you’re going to be pulling the cache line back into your core’s L1 in shared state, which will then cause it to have to be invalidated before the consumer can move it to exclusive state to modify it. If the only write to the message is setting the flag, that probably doesn’t matter too much to the consumer because that store can just sit in a store buffer while the core gets on with useful work but if it’s doing any in-place modification then now you’ve got some painful false sharing.

                      As you have more in-flight messages, you may get into steady state where the oldest message is usually done by the time that the producer is done.

                      Again, you will usually hit a message that another thread is already done with, first thing.

                      In that case, it will be in exclusive state on the other core’s cache, so you will pay up to around a 300 cycle penalty to bring it back. This will cause a pipeline bubble because all of your useful work depends on the result of that load. Oh, and now you have a data-dependent conditional branch on your fast path, so your branch predictor will hate you.

                      All of this gets more exciting once you have even modest fan-out (e.g. producer dispatching work to 8 workers) because now you don’t have any guaranteed order of completion, only guaranteed order of starting, so you have to scan more things in the list to find a completed one and all of these problems are multiplied.

                      Or you can just use snmalloc and see very high transaction throughput rates without having to build a bespoke solution. From the perf reports that we’re getting from downstream consumers with scalable transaction-processing workloads, I’d be pretty confident that this would be faster.

                      And hardware permitting (so no x86), you can even use relaxed orderings which will incur no penalty.

                      You can’t use a relaxed ordering because you need to establish a happens-before relationship with the thread that is setting the flag. A relaxed ordering does not do this, it guarantees only that you will see either the ‘before’ or ‘after’ value of the variable (so if you initialised the variable to 0, then that value can be store-forwarded to your load without checking if another thread modified it).

              3. 3

                People might be missing some context that reviewing the June 2019 research article for this allocator is particularly timely because in October 2021, this allocator was used as a replacement for “pymalloc” in a proposed changeset written by Sam Gross against the CPython interpreter which removes the GIL (Global Interpreter Lock), potentially enabling multi-core parallelism for many workloads for Python code.

                This is described briefly on LWN here. I also x-posted a nice breakdown of this CPython proposal from CPython Developer-in-Residence Łukasz Langa, which can be found here on lobste.rs.

                1. 1

                  Both mimalloc and snmalloc are much simpler than tcmalloc or jemalloc, primarily (AFAICT) because they don’t use thread-local freelists, which seems to be the primary source of complexity in other high-performance allocators. I feel like “you need thread-local freelists for performance” is perhaps the biggest contemporary myth in allocator design.

                  1. 2

                    I feel like “you need thread-local freelists for performance” is perhaps the biggest contemporary myth in allocator design.

                    I don’t think that’s quite an accurate characterisation. Snmalloc does have thread-local freelists. There’s a philosophical difference between something like jemalloc and snmalloc. Jemalloc is a single-threaded global allocator that then does local caching to try to reduce contention. Snmalloc is a single-threaded local allocator that uses an actor-model design to allow multiple instantiation. Every thread owns an snmalloc allocator object. This object owns a load of chunks, which it uses to perform allocation. Each chunk is used to allocate slabs, which are fixed size allocations. On the slow path, we construct a freelist of a particular size after creating a slab and then that freelist is exclusively accessed by a single allocator. If you free from another thread, the allocation is batched and then sent as a message to the owning allocator (if you have more than 64 allocators, it may be sent indirectly, which adds some latency to memory recycling but improves throughput), which pops it into the local freelist.

                    The hypothesis for snmalloc was that designing an actor-model memory allocator with a fast shared-memory message-passing scheme would give better performance than trying to do lock-based concurrency. So far, this hypothesis has been supported by the evidence.

                    1. 1

                      Yeah, I wasn’t clear in my terminology, sorry. I was referring to the common design where the consumer can directly allocate from a cache of objects freed by that thread but allocated by other threads. That’s what I think is commonly referred to as a “tcache” in tcmalloc and (later) jemalloc. This of course leads to false sharing and requires complex GC mechanisms to eventually return cached objects to their allocating heap or to a global heap when the cache gets too large. The key difference with modern allocators like snmalloc and mimalloc is that they never allocate from memory originally allocated by another thread, right?

                      1. 2

                        The key difference with modern allocators like snmalloc and mimalloc is that they never allocate from memory originally allocated by another thread, right?

                        That’s more or less what I said, but I oversimplified very slightly. It’s true for all small and medium allocations but for large allocations we allocate entire pages with mmap or a platform equivalent. When you free one of these, it’s cached locally in the freeing allocator and then returned to a central pool (snmalloc never returns address space, only physical memory), where it can then be used to satisfy later large allocations or can be split up to provide new chunks.

                        Aside from that corner case, I agree that this is the correct characterisation.