1. 85

Hi Lobsters,

I know people here may be tired of seeing too many Zig stories, but, I think this is a legitimately cool discovery that transcends programming language turf wars.

This is a memory allocator that outcompetes glibc malloc with an incredibly simple implementation, that can be read and understood in a matter of minutes, even if you have never used the programming language it is written in, even if this is your first time considering a memory allocator implementation.

I am the author of this code, which is fairly new and not yet battle-tested. Maybe I’m missing something crucial, or maybe it’s going to become a specimen for computer science students to learn about memory allocation from. Either way, I hope you will find it interesting and on-topic here.

I added the “zig” tag out of courtesy, but it’s really not the point of the story, the point is that it’s possible to implement a good memory allocator that is also conceptually simple.

-Andrew

  1.  

    1. 27

      A standard “worst case” for allocators is producer-consumer: in a loop, one thread allocates, sends the pointer to the other thread, which then deallocates. I think for this scenario the presented allocator will give infinite growth of memory usage?

      If I am reading the code correctly (which is not given!), in the producer-consumer scenario, producer will fill its slab, while consumer will fill its free list. Then, producer will rotate to the next Thread, and fill its slab. Eventually, producer will “steal” consumer’s thread (once both have the same thread index, they’ll contend for mutex and one will be edited to the next id), and re-use all entities from the free list.

      But then this repeats! Once stolen free list is exhausted, we allocate one more slab, and do a merry-go-round again!

      1. 20

        Yup, that’s the case that snmalloc and mimalloc tune for because improving it can double the throughput in a lot of transaction-processing workloads. Thread caching approaches are the worst here because the allocating thread always misses in its cache (no freed objects) and the freeing thread ends up having to walk its cache list periodically and properly free everything.

        It’s also worth noting that benchmarking allocators is incredibly hard. For example, we found that snmalloc was much worse than everything else on one microbenchmark. We eventually tracked the issue down to cache contention. Our allocation pattern meant that the metadata ended up in a small handful of cache associativity sets and so we were using a small fraction of the total cache. We added some extra logic to add some jitter to the displacements to avoid this. The benchmark got faster, but everything else got slower. What made the benchmark special? It wasn’t ever touching any of the memory that it allocated and so almost 100% of the cache was available for the allocator to use. Making the benchmark touch even 20% of each allocation eliminated the gap between snmalloc and the other fast allocators. In practice, if you’re allocating a load of memory and then not touching it, the right fix is not to make the allocator faster for that workload, it’s to not allocate the memory (outside a few niche use cases such as kernel memory that you need to preallocate so an interrupt handler can use it in a situation where it can’t acquire locks).

        1. 12

          This example, right?

          It’s a great example because I didn’t think of it until someone pointed it out to me.

          There is a performance data point of that example program in the readme file in that repository.

          Your analysis is almost spot-on, but consider that it first tries to rotate before allocating another slab. So it is more likely that the rotation finds another freelist, and continues reusing memory from there.

          At first I thought that adding rotation to the free function would improve the situation. When I tried it, I found that in the linked example it indeed reduced peak memory usage in exchange for performance, but in real world scenarios, it actually performed worse than the simple free implementation on all metrics.

          1. 6

            but consider that it first tries to rotate before allocating another slab

            Rrright! I got confused by this likely:

            https://github.com/ziglang/zig/blob/5ad91a646a753cc3eecd8751e61cf458dadd9ac4/lib/std/heap/SmpAllocator.zig#L151

            Feels like it should be unlikely, as enter there at most once.

            So, yeah, I think this actually fixes it! Because we rotate first, I think what’ll happen is producer thread “chasing” after consumer thread! Nifty!

            Still, I’d adjusted that example to say while (true) and left it running overnight, no make sure it doesn’t OOM eventually.

            1. 5

              Hm, thinking about this for another five minutes, I think this still breaks if you have, say, 4 producer threads and one consumer thread? In this situation, producer will be more likely to rotate over another producer. While sometimes we will steal consumer’s free list, we should still fall back to slab alllocation infinitely often? I’d run a benchmark, but I have a rule against laptops in bed :P

              1. 19

                Yup, there it is, O(∞) total memory usage for a program that O(1) memory at any given moment in time: https://github.com/andrewrk/CarmensPlayground/pull/1

                1. 4

                  Great find. That prompts the question, is the next step to try to mitigate such problems in SmpAllocator, or keep it how it is, document its performance characteristics under different workloads, and create another allocator that is designed for mpsc? 🤔

                  I’m leaning towards the latter, since choosing an allocator tailored to the application is something Zig is uniquely positioned to offer.

                  1. 4

                    My gut feeling answer is to just wait for some community member to write a cool pure-zig allocator that can become the default recommendation, but, at this point, I’ve been waiting for seven years for that to happen in Rust, and that didn’t quite materialize (to be fair, for a long time writing allocators in rust was hamstrung by slow threadlocals, but that has been fixed years ago as well).

                    With my TigerBeetle hat on: never call free, problem solved :P

                    1. 2

                      I think the default allocator should perform tolerably well in all situations. SmpAllocator performs very badly in the benchmark that matklad presented, so I don’t think it should be the default. I think it would be fine to document the performance characteristics of SmpAllocator and leave it as an option.

              2. 4

                I guess this can be said more simply: in a producer-consumer scenario, if producer and consumer have different thread_index, then slab allocation happens eventually. If they have the same index, contention happens eventually and the indexes become different.

              3. 21

                The allocator design space is huge and nuanced, so at a cursory glance I’m taking the strong claims by the blog post and related PRs/issues with a little skepticism. Definitely looking forward to more rigorous comparisons with other widely used/SOTA allocators!

                EDIT: Forgot to mention that the code is readable and easily understood even though I don’t write Zig, really recommend at least perusing the source!

                1. 21

                  Yeah, one obvious thing is that I don’t see any measure of fragmentation. If you use more space, you can save time.

                  https://ziglang.org/devlog/2025/#2025-02-07

                  I only see 2 workloads there – a synthetic benchmark, and a Zig compiler workload. Allocator performance is VERY multi-dimensional, so you need more workloads to draw any conclusions

                  e.g. test on say Firefox, Redis, and SPEC benchmarks, like this paper:

                  https://people.cs.umass.edu/~mcgregor/papers/19-pldi.pdf

                  and measure both space and time. Different programs have very different allocation patterns

                  Otherwise I’d just say “outcompetes glibc malloc on the things I tried”, not “outcompetes glibc malloc”.


                  I wouldn’t say that that Oils has an allocator that beats glibc too, even though here it runs 74.5% of the time as glibc (IOW more than 25% faster), as measured by cachegrind.

                  47.7 	_bin/cxx-opt/osh 	mut+alloc+free+gc
                  64.0 	_bin/cxx-opt+nopool/osh 	mut+alloc+free+gc 
                  

                  and on another workload it runs in 69.3% of the time of glibc:

                  25.4 	_bin/cxx-opt/osh 	mut+alloc+free+gc
                  36.6 	_bin/cxx-opt+nopool/osh 	mut+alloc+free+gc 
                  

                  https://oils.pub/release/0.27.0/benchmarks.wwz/gc-cachegrind/ (these measurements include GC time, but there is also a mode to exclude GC, and I know it’s faster in both modes)

                  Looking at the code, it’s 124 lines of C++.

                  Critical thinking exercise: why is this NOT surprising? (and arguably, not impressive)

                  1. 11

                    Also, while doing that work, I observed that glibc free() is often slower than malloc().

                    Periodically, there would be glibc functions called from free() that showed up at the TOP of the perf profiles.

                    Which I now know to be “coalescing”, i.e. to reduce fragmentation

                    So you can make a faster allocator than glibc simply by removing the coalescing, at the cost of space.

                    The Zig allocator is not doing any coalescing

                    https://sourceware.org/glibc/wiki/MallocInternals - some details on glibc internals

                    Claude AI claims that glibc uses Knuth’s boundary tag coalescing, described in 1973 in TaoCP. The wiki page doesn’t say that (it seems plausible from what I read?), but that is a real thing:

                    https://www.cs.cmu.edu/afs/cs/academic/class/15213-s12/www/lectures/18-allocation-basic-1up.pdf


                    But this is actually separate from the point I was trying to make, which is that we “just” wrote a pool allocator in 124 lines of code (done by Chris Watkins)

                    And based on knowledge of the workload, we have a dirt simple policy - there are 2 pools, and one is 24 bytes, and one is 48 bytes. That’s a lot faster than glibc for our workload.

                    So it’s easy to beat glibc on particular workloads

                    mimalloc was also slower for us than this scheme - https://github.com/microsoft/mimalloc

                    I’ve heard the claim that “there is no such thing as general purpose memory allocation”, and I tend to believe that … malloc/free is a bare bones interface that can be used in a bazillion ways! There are many dimensions


                    All that said, I’m not arguing this isn’t a good allocator for certain programs (“idiomatic Zig” ?). And I’d bet that glibc was written / tuned in the days when hardware was much different, e.g. cache misses were 10x-100x less expensive, relatively speaking

                2. 12

                  the point is that it’s possible to implement a good memory allocator that is also conceptually simple.

                  All memory allocators start conceptually simple. And over time they become more and more heavy-weight as performance (both in latency and memory utilisation) on average and in the worst case is gradually improved.

                  glibc’s allocator design is ancient (it is a dlmalloc descendant after all) and it prioritizes stability over performance. It is a sensible choice for something that is used by default almost on every Linux distribution. Almost always modern general-purpose allocator (tcmalloc or jemalloc) performs much better than it, so it would make sense to compare to those. Google and facebook spent tens of man-years tuning them, because every single program allocates memory so improving such a base facility even for a little pays off. (you may say that only poorly software allocates memory left-and-right but it is a fact of life - most software does; only things on absolute hot path usually get optimized).

                  For heavy multi-threaded use-cases there are specialized allocators, often with different tradeoffs. Arguably, there are more “good” allocators targeting multi-threaded use rather than general-purpose, because usually people write them once general-purpose solutions fail to solve concrete problem they have, and at least on one real use-cases those allocators shine.

                  1. 7

                    One question that comes to mind is how this allocator can be hardened against attacks. Zig makes buffer overflow and use-after-free vulnerabilities less common than in C, but they can still happen, and when they do, allocator exploitation can provide a route to much more damaging attacks.

                    For example, since SmpAllocator uses a linked free-list, it looks pretty trivial to have it return any address you want, if you can modify a freed slot. Pointer encryption would be pretty easy to add to SmpAllocator, and can help with this, though it’s not foolproof.

                    1. 2

                      One example of an allocator with a lot of hardening would be OpenBSD’s malloc. It has canaries, some double free and use after free protections, guard pages and more.

                    2. 5

                      The power-of-two size classes remind me of phkmalloc. Newer allocators, jemalloc and so on, typically use log-linear size classes. I like Paul Khuong’s branchless log-linear bucketing https://pvk.ca/Blog/2015/06/27/linear-log-bucketing-fast-versatile-simple/

                      Bucketing has some weird side-effects. There’s the wastage from rounding up allocation sizes. There are more subtle effects that occur with use-after-free errors in buggy programs: the more precise the allocator’s bucketing, the more type-stable (by accident) the memory.

                      I get the impression that glibc’s allocator makes its own life harder because it isn’t based on segregated size classes. So it has more problems with fragmentation, and is much more vulnerable to heap corruption from buggy programs.

                      1. 7

                        Both snmalloc and mimalloc copy the bucket strategy from SuperMalloc, which uses all values in the top three bits of the size and all of the low bits are then zero. This is also very quick to calculate from the size, but we found on modern microarchitectures the fastest approach was a lookup table not the bit twiddling we started with. This makes the worst case overhead 12.5%. In practice it’s much lower for two reasons:

                        For smallish allocations, the alignment requirement dominates and so you don’t see much overhead (from some old data, 48 bytes was the most common allocation size and we have zero fragmentation for it).

                        For large allocations, we are burning address space, not memory. The maximum wasted memory is one page minus one byte. Anything else will be demand paged by the kernel. We don’t use size classes for very large allocations (defined as anything larger than a chunk). We just allocate a set of chunks, but even if we did then the overhead would be address space not memory.

                      2. 5

                        How much of this is due to Zig’s superior allocator interface? As I understand Zig’s free is given allocated size, but C’s free is not (and this is why free_sized was introduced in C23).

                        1. 9

                          In snmalloc, looking up the size from a pointer is sufficiently fast that we can do it in every memcpy with a negligible slowdown (and eliminate around 10% of vulnerabilities that lead to arbitrary code execution). It’s basically noise on the free path. We have an API if you know the size of the object but it makes very little difference to performance in practice if you use it.

                          1. 1

                            That’s quite interesting. Does it work even if the pointer is not the base pointer of the allocation?

                            I.e. memcpy(dest + 1, src + 1, len - 1)

                            1. 5

                              Yes. Given any address, we can quickly tell whether it’s a heap allocation and, if so, where the start and end are.

                              In my FreeBSD branch, I’m using snmalloc as the libc allocator and doing bounds checks on every memcpy. Some MS folks are porting it over to glibc for the distro Azure uses.

                              I had an old branch that also added these checks in things like sprintf, read, and a bunch of other libc places.

                          2. 8

                            The allocator interface is certainly load-bearing:

                            1. Both resize and remap are allowed to refuse (requiring the caller to make a fresh allocation, copy, and then free the original)
                            2. Allocator API users must report the requested size and alignment when freeing memory.
                            1. [Comment removed by author]

                              1. 1

                                Not really. It will copy the data inside the allocator rather than returning NULL. Also it does not allow the caller to forbid the pointer location from changing.

                                If realloc returns NULL the correct failure path is to handle an out of memory error, not try to make a fresh allocation.

                                1. [Comment removed by author]

                                  1. 1

                                    Edited my combative message to the Socratic method instead:

                                    Are you saying that realloc returns NULL to indicate refusal to resize, as opposed to failure to resize, followed by failure to make a fresh allocation and copy the data over?

                          3. 2

                            From the module doc comment:

                            This works because the implementation refuses resizes that would move an allocation from small category to large category or vice versa.

                            Couldn’t it just copy the memory’s contents to move it? Or does Zig assume resizing memory won’t copy it?

                            Ooooh, looks like that is correct: Zig’s resize() function doesn’t allow the memory to move, and there is a separate remap() function that does. Interesting detail!

                            1. 3

                              Slightly more helpful link

                              I can give you an example of each:

                              • resize - used by ArenaAllocator to grow, where existing allocations must not be moved. If it cannot be grown this way, ArenaAllocator falls back to allocating a new bigger slab and linking back to previous one for the purpose of freeing the arena.
                              • remap - used by ArrayList to grow, where it’s OK to change pointer address, but would be better to do the memcpy in ArrayList than inside the Allocator implementation since there’s no point of copying the unused capacity bytes. On the other hand if the remap can be done with e.g. mremap then it is worth it.
                              1. 1

                                Makes a lot of sense, thank you.

                            2. 1

                              Very nice. How could this be adapted for a malloc-like API where “free” isn’t given the size of the block? My first guess is that you’d store the size class in the first few bytes of a slab, and then “free” would look it up by masking the block address. Or if the block address is already at the start of a slab, it must have been allocated by the slab allocator.

                              1. -10

                                So it is okay to write such long annotations to one’s own postings? Okay, I will do so next time :)

                                1. 18

                                  I think it’s not about yours/other’s, but rather the fact that this is a link to raw source code, not to a blog post about the source code.

                                  That is, you shouldn’t contextualize something which already explains itself, but, if you are linking to a primary artifact, some explanation is in order!

                                  1. 7

                                    I think this is mistagged, it should be in the “show” category, where it is common to actually write a contextualising comment.