1. 8
  1.  

  2. 12

    Rust allocator API passes size explicitly. This is not painful because in most cases it is not called by users.

    There is another implication of C allocator API: not only allocator needs to save size, it also needs to lookup it. Size lookup is expensive: https://github.com/openssl/openssl/issues/13787 suggests it is approximately 30% of deallocation cost.

    One way Rust can be faster than C in the future is to exploit its better API that can be better optimized. This potential has not been realized yet.

    1. 3

      With snmalloc, we expose the Rust allocator interfaces and have had reports of some fairly significant speedups from Rust codebases. The only thing that we use the size parameter for is error checking: we ensure that it conforms to the size associated with the allocation (optionally, we don’t in the builds with the go-faster stripes). We had initially hypothesised that knowing the size would help but this turned out not to be the case.

      The first reason that it isn’t useful is that the caller doesn’t actually know the size of the allocation. The caller knows the size that it requested but the real size may have been rounded up. For example, if you ask for a 46-byte allocation, we’ll give you a 48-byte one. If you tell us that you have a 46-byte allocation then it’s not much easier for us to round that up to 48 than it is for us to just calculate the size (I think it may be slightly more costly).

      In snmalloc, the first thing that you need to do on the deallocation path is discover which allocator instance owns the object. If it’s the one being invoked, then it will handle the deallocation directly, otherwise it will send a message back to the originating allocator. We do this lookup by looking in a flat array indexed by the significant bits in the address address (by default, we use 16 KiB chunks, and have one entry per chunk. On a system with a 48-bit address space this means that we need 2^34 entries, so we have a very large array where most of the pages are not allocated). When we complete that lookup, we know three things:

      • The location of the metadata for this chunk.
      • The location of the message queue for the owning allocator.
      • The sizeclass of the chunk.

      All of our freelists are indexed by sizeclass, not by size, and so if you give us a size then we first have to round it and then we have to map it to a sizeclass (this is a more complex calculation than mapping from sizeclass to size, but still fast enough that we can do it on our very-fast allocation hot path).

    2. 3

      On the Amiga the original memory allocation routine, AllocMem, did not save any metadata and so you had to pass a size to FreeMem.

      Later the OS included AllocVec and FreeVec, which stored metadata and thus didn’t require passing a size to FreeVec.

      1. 2

        It’s too bad that C++’s operator delete doesn’t take a size parameter, since it’s generally called in a context where the allocation size is known — when I delete a foo* the allocation is assumed to be the size of a foo, since that’s how much new foo allocated. The exceptions are

        • Deleting a pointer typed as a superclass; but this is illegal/UB unless the base class has a virtual destructor, and in that case an overridden destructor can make the delete call with the correct size.
        • Deleting an array, I.e. operator delete[]. Here I don’t think there’s a way to pass the size because it may not be known at the call site. Fortunately the corresponding operator new[] could stash the size before the pointer.
        1. 1

          an overridden destructor can make the delete call with the correct size

          I don’t know the first thing about c++, but I don’t think this works. Say we have A <: B <: C; C, the base class, has a virtual destructor; B implements it, and A does not override it. Then say we have an object of static type C and dynamic type A; A inherits B’s destructor, but it is not B-sized.

          1. 1

            Yeah, the compiler would have to emit a destructor for C to make this work. Another option is to store the instance size as an entry in the vtable.

            1. 1

              It doesn’t, but it’s a common misconception that destructors are responsible for release memory which is the error here.

            2. 1

              for array allocations in C++ a new ...[...] must be matched by delete [] .... The C++ runtime is expected to ensure that a call to new ...[...] includes an element count as a prefix at ((size_t*)the_array)[-1]. So they know at least the element count (needed for destructors), as well as the allocation size.

              Calling free() or delete on the output of new[] instead of delete [] is UB and will likely corrupt the heap. Similarly calling delete [] on the result of malloc/mmap/… is UB and will likely corrupt the heap. The fact that pointers and arrays are conflated in C and C++ is an endless source of misery and bugs.

              For deletion of an object the site of delete is responsible for freeing the object, and at that site it does not know what that it. An object’s destructor does not know how to free itself, as the destructor does not know where or how the object was allocated (e.g. global, stack, custom allocator, etc). delete is doing something very close to

              somePointer->~TypeName();
              free(somePointer);
              

              That’s why operator delete and operator delete[] are overridden separately from destructors.

            3. 1

              You do not necessarily need to save metadata.

              For example, consider the following allocator design: a large region of address space is reserved for small allocations. (Large allocations can be mmapped and have their metadata in a hash table or something; whatever.) The small allocations space is divided into chunks of, say, 64k each—some medium-ish power of two—and each chunk is carved up into a freelist for allocations of some size class. The ‘next’ pointer for each freelist can either be at the base of its chunk, or in some separate array indexed by the high bits of the chunk base pointer (latter is more space-efficient and protects integrity of heap internal data structures from user buffer overflows, but whatever). Then ‘free’ can begin with a cheap check that the pointer freed is in the small region; if it is, then it can do all it needs to simply by chucking the low bits of the pointer. It is then unnecessary to save metadata for every allocation.

              1. 5

                The article already says this:

                A slab allocator that uses size classes doesn’t need size metadata for individual allocations that fall into size classes, because the size of an individual allocation is implicit in being allocated in a particular size class’s arena.

                The trouble is you have to work out which arena the pointer falls within, which is I believe not as trivial as you’re implying.

                1. 3

                  The trouble is you have to work out which arena the pointer falls within, which is I believe not as trivial as you’re implying.

                  For snmalloc, this is trivial: it is a mask, a shift, and a load. It will be slowest if that load misses in the cache.

                  Note that we don’t just do this for small allocations. Large allocations are allocated as power-of-two-sized chunks and have their size in the same location in the pagemap (we will reserve a power-of-two-sized range of VA space, we may not commit all of the pages for that range), so we can look up the size very cheaply for any allocation size.

                  This check is sufficiently fast that we can do it on every memcpy and check that the destination is either not a heap allocation, or is an in-bounds heap allocation, and still beat jemalloc in the small amount of benchmarking that I’ve done in that mode.

                  1. 2

                    I believe not as trivial as you’re implying.

                    I don’t think this is a very constructive thing to say. I described a scheme, and why I think it’s cheap. If you think that it is in fact not cheap, or that the scheme doesn’t work, then you should say why.

                    Here’s something empirical: in mimalloc (which is a high-performance, general-purpose allocator, and which iirc uses a scheme which is not dissimilar to the one I described), there is a free function which takes an explicit size. The size is used only as a debug check; it is not used to speed up the freeing process at all.

                    1. 2

                      I don’t think this is a very constructive thing to say. I described a scheme, and why I think it’s cheap. If you think that it is in fact not cheap, or that the scheme doesn’t work, then you should say why.

                      I also agree, and it’s due to past experimental evidence. Not something easily summarized into a few sentences.

                      1. 1

                        To be clear, I didn’t mean to imply that it is necessarily cheap to do a size lookup. Only that it may be. There may be good reasons to write a memory allocator which must do a size lookup if the user does not provide a size, and in which that lookup is quite costly. My point was simply that there can be performant schemes which require neither a user-provided size, nor metadata, nor a costly lookup. I think the mimalloc snippet I linked demonstrates this; mimalloc is quite well regarded.

                  2. 2

                    Yes, but small region check is not cheap at all. 30% CPU overhead I quoted above is for such small region check design. Fast paths of modern malloc is so fast that check cost is not trivial.

                    1. 3

                      The actual quote is that

                      For jemalloc, sized deallocation is approximately ~30% cheaper in terms of CPU compared to unsized deallocations.

                      That doesn’t imply that the system malloc spends 30% of its time looking up sizes. My guess is jemalloc is heavily optimized for having the caller pass the size, leading to an expensive workaround when it’s not passed in. But a regular malloc is heavily optimized for not having the size passed.

                      (And one can’t generalize about malloc performance in general, because there are big differences between the Linux, Darwin and Windows implementations, plus I’m sure BSD and other OSs.)

                      1. 1

                        I expect you must be talking about something different than what I am talking about. The mechanism I described is extremely cheap; passing in a size would not be cheaper.

                        (Beyond which, performance is a bit irrelevant; my point was simply that it is not necessary to keep such metadata, not that it is necessarily desirable. Allocator design is complex and subtle, and mired with application-level concerns. My goal was not to give a sample design for a high-performance, general-purpose allocator.)

                        1. 1

                          Looking up size classes is not particularly expensive, but you have to understand that when numbers like 30% are thrown around you are talking about the general case release path looking like this

                          auto ownerHeap = lookupHeap(ptr); // heap/slab
                          auto ptrAsLLNode = (LinkedListNode *)ptr
                          ptrAsLLNode->next = ownerHeap->freeList
                          ownerHeap->freeList = ptrAsLLNode
                          

                          That’s pretty much what they all look like, so 30% more expensive does not need lookupHeap to do much at all before it’s proportionally significant. Something can be proportionally expensive without being actually expensive.

                          I can’t recall how FastMalloc/TCMalloc worked (my experience was mostly beating them into some semblance of “secure”), but JSC’s GC uses freelists, and looking up the size class generally meant (at the time, I don’t know what happens now) masking off the low bits of the pointer to get to the base of the page allocations containing the relevant pointers, and thus that segments metadata - this big reason for needing this is looking up pointer-looking things on the stack as JSC’s GC is conservative w.r.t pointers on the stack.