If you asked for 1 bytes, you are probably getting a large chunk of usable memory: maybe between 16 bytes and 24 bytes. Indeed, most memory allocations are aligned (rounded up) and there is a minimum size that you may get
I was reading about memory alignment recently. As I understood, C23 clarified that there’s no requirement to max_align_t align everything (not implying that the article says otherwise).
The alignment requirements of malloc, calloc, and realloc are somewhat confusingly phrased, in a way that affects small allocations (sizes less than _Alignof(max_align_t)). Some readers (and implementations) interpret them to demand _Alignof(max_align_t)-alignment even for allocation sizes that could not hold an object with that alignment. We call this the “strong-alignment” reading. […]
Many implementations only provide weak-alignment guarantees […]. Therefore, we propose clarifying the wording such that the weak-alignment implementation is unambiguously allowed.
Pointers returned by allocation functions such as malloc are suitably aligned for any object, which means they are aligned at least as strictly as max_align_t.
As the article mentions, there’s typically a practical concern before the standard one, that limits the smallest chunk that you can hand out: you need to track freelist metadata somehow.
The two memory allocators I’m most familiar with are fairly representative of the two common approaches. Snmalloc is a modern sizeclass-based allocator that is intended for high performance on large (i.e. mobile phone up to server) systems. CHERIoT RTOS’ allocator is a dlmalloc-based range allocator, optimised for memory consumption on small devices. In both cases, the smallest size that can be allocated is >1 byte.
Snmalloc allocates objects of the same size class from chunks. By default, I think (we’ve changed it a few times), chunks are 64 KiBs. When you allocate a new chunk (which happens the first time you allocate an object of a particular size class), it constructs a linked list of objects of that size in the chunk and then each subsequent allocation is just popping from the head of that list. Allocation is around 14 x86 instructions on the fast path. To be able to build the freelists, you need to be able to store a pointer (or, at least, a chunk offset) in each object and, for some security, there’s an option to store these with some error checking, which means that you can’t really scale the size classes down to less than 16 bytes very easily.
This approach is nice because there’s no per-object metadata. Wasted memory comes from two kinds of fragmentation. If a slab is partially used, the unused objects take space. New allocations can be made from these chunks, so that tends downwards as more memory is allocated, so the worst case (relative) overhead here is when you have very low total memory usage (which is why we don’t use this allocator in CHERIoT RTOS, but why I do prefer to use it everywhere else). The number of size classes is also finite. Snmalloc uses the ripple patter that, I think, was first proposed by SuperMalloc, where each size starts with 0b100, 0b101, 0b110, or 0b111. This ensures that, in the worst case, the allocated size is 12.5% large than the requested size.
In the dlmalloc-style allocator, the allocator stores a header before each chunk. Ranges are created by splitting other ranges. If you want to allocate 512 bytes, it will find a chunk that is at least 512 bytes (there are lots of different approaches here as to how it finds the chunk) and then split off a bit that is 512-byte + header. If you allocate an entire heap full of 8-byte objects and then free every other one, you fragment the heap and no subsequent allocation can succeed, but in general this approach works pretty well. The down side is that you need to be able to have the header adjacent to the object. For non-CHERI systems, this is a massive security risk because a one-byte underflow can corrupt allocator state and give an attacker a huge amount of control. For CHERI systems, it’s fine because the malloc consumer doesn’t have access to the header. You typically keep the headers aligned to simplify combining ranges on free, so that any combined range is always a max_align_t-aligned thing at the start and end. In CHERIoT we have the additional constraint that our hardware temporal safety mechanism works in 8-byte chunks and so there’s no way to free anything smaller than 8 bytes.
The one bit I’d disagree with in the article is the last line:
Finally, you should make use of realloc if you can as you can often extend a memory region, at least by a few bytes, for free
On a range-based allocator (on non-CHERI systems), realloc may be able to extend. You’ll have some surprising performance when it doesn’t. You also need to be very careful with realloc because the API is horrible:
If it fails, it does not free the source and so you have to make sure you keep a copy.
Comparing the copy that you’ve kept against the newly returned value is really hard to do without introducing undefined behaviour.
Unless there’s a huge performance win from using realloc, don’t. It isn’t worth the code complexity. And measure it every week because small changes to your allocator (or other allocation patterns) may make it go away.
For a sizeclass-based allocator, realloc is always either a no-op (the new size is the same sizeclass) or it is a malloc, memcpy, free, sequence. In the latter case, it’s much clearer to do this explicitly and APIs such as malloc_usable_size will tell you whether you need to do it.
In short, realloc should be regarded as an experts-only function. Any use of it is likely to either be wrong now, or wrong after some unrelated changes to the program (including libraries that it depends on, including libc). Friends don’t let friends call realloc.
There’s a pointer provenance footgun with realloc(): you cannot compare the pointer you passed in with the pointer it returned to find out whether the allocation moved, because after a pointer has been free()d it cannot be used for any purpose. (Maybe you can get away with a uintptr_t circumlocution?)
There is an issue related to dynamic size checks (FORTIFY_SOURCE) and malloc_usable_size(). It isn’t entirely clear what the semantics are, or even what they should be. If you want to avoid trouble, you must never stray outside the size you passed to malloc(), regardless of what malloc_usable_size() says; if you want to use the usable size, you must explicitly realloc() to grow the allocation to the usable size, and (per the previous paragraph’s footgun) you must remember not to use the original pointer after calling realloc() because its provenance is different even though you know its bit pattern is the same.
you cannot compare the pointer you passed in with the pointer it returned to find out whether the allocation moved, because after a pointer has been free()d it cannot be used for any purpose.
Are you 100% sure the standard is intended to be interpreted that way? The last time I investigated this issue I remember finding sufficient evidence that comparing freed pointers was well defined. It’s only undefined to dereference the freed pointer. The same semantics apply to null pointers, it’s perfectly fine to compare any pointer to a null pointer but undefined to dereference them.
Comparison is problematic in a provenance-carrying model (there’s a WG14 task group trying to define what the provenance-related semantics of C actually are). If I allocate an object, keep a pointer to it, free the object, and allocate a new object that happens to be at the same address, what should the result of comparing the two pointers be? Valid options are:
True, they are the same address, pointers are a thin veneer around BCPL addresses, they compare equal.
False, a newly allocated object is guaranteed not to alias anything. The implementation may assume this property, optimise based on it, and must maintain sufficient ghost state to track it.
Undefined, the implementation may track ghost state and assume that the new pointer does not alias any existing ones but may fall back to dynamic address comparison if there is insufficient static ghost state.
Typically, the third interpretation is used by compilers and folks in WG14.
This is even more interesting on platforms like CHERIoT where the hardware maintains the required ghost state and so we can differentiate between a pointer to a live object and a pointer to a freed object at the same address (the former will work the latter will trap if you try to use it). Even there, it’s fun because we can’t differentiate between two pointers to different freed objects of the same size and at the same location (neither pointer is usable and both are the same bit pattern in memory).
The intention of the standard writers matters less than how the standard is interpreted by the compiler writers 🙃 But I think in this case it’s clear that standard requires that free() has spooky side-effects on its argument. And it’s possible to imagine how such spooky side-effects could realistically happen: a compiler might mark a variable as dead when it is passed to free(), which would cause any subsequent uses of the variable to be compiled as if it were uninitialized, so a subsequent test for equality could use any arbitrary value instead of its value before free().
Following quotes are from n3220, which is C23 with one trivial editorial change in a footnote in annex K. Note that in C23 non-value representation is the new terminology for what was called a trap representation in earlier editions of the standard.
7.24.3 Memory management functions
… The lifetime of an allocated object extends from the allocation until the deallocation. …
6.2.4 Storage durations of objects
… The representation of a pointer object becomes indeterminate when the object the pointer points to (or just past) reaches the end of its lifetime.
3.23 indeterminate representation
object representation that either represents an unspecified value or is a non-value representation
3.22.2 unspecified value
valid value of the relevant type where this document imposes no requirements on which value is chosen in any instance
3.24 non-value representation
an object representation that does not represent a value of the object type
6.2.6 Representations of types
6.2.6.1 General
Certain object representations do not represent a value of the object type. If such a representation is read by an lvalue expression that does not have character type, the behavior is undefined. …
When I looked at this issue I was looking at the C89 spec. For various reasons the text of the c23 standard is irrelevant to me but point well taken and appreciated.
As an aside, reflecting on my past usage of realloc(), I never had to check whether the allocation moved. For my own curiosity and edification, do you have a representative use case where detecting allocation movements was necessary?
I have a dead tree copy of C89 which says it is undefined behaviour much more straightforwardly than C23. (quotes below)
I don’t remember ever wanting to check if a realloc() didn’t move, but I imagine if it contained interior pointers, someone might want to skip fixing them up when possible.
7.10.3 Memory management functions
… The value of a pointer that refers to freed space is indeterminate.
3.16 undefined behaviour: behavior, upon use of a nonportable or erroneous program construct, of erroneous data, or of indeterminately valued objects, for which this International Standard imposes no requirements. …
There is an issue related to dynamic size checks (FORTIFY_SOURCE) and malloc_usable_size(). It isn’t entirely clear what the semantics are, or even what they should be. If you want to avoid trouble, you must never stray outside the size you passed to malloc(), regardless of what malloc_usable_size() says; if you want to use the usable size, you must explicitly realloc() to grow the allocation to the usable size, and (per the previous paragraph’s footgun) you must remember not to use the original pointer after calling realloc() because its provenance is different even though you know its bit pattern is the same.
For CHERI C, we make these guarantees a bit more explicit: The size of the object returned by a successful malloc must be at least the size requested and the caller may use all of that object.
This makes a bit more sense on a CHERI system, where any read or write out of the bounds of the pointer returned by malloc will trap and the bounds of the pointer can be checked. It’s fine to call malloc and then pass the pointer to something else that then writes data into the entire buffer using its size, which means that malloc must either:
Return the exact requested bounds, or
Return a larger bounds but not allocate anything else into the extra space for the lifetime of the object.
It’s probably okay for it to be UB to write past the end of the requested size without checking the size of the returned value because the allocator may return something with precise bounds, but I’d almost want that to be IB because it’s fine to write past the end if you know the rounding rules that the allocator is using (or the representable bounds of your capability encoding, which gives the lower bound on the amount of padding the allocator must provide).
Yeah, you can do that when you are best buddies with your compiler and libc, and you know that you will never tickle __builtin_dynamic_object_size in a way it doesn’t like :-) Code that also runs in Red Hat land needs to be more circumspect.
Don’t some size-class allocators use a bitmap in the chunk header to determine which blocks are free? With that, you can allocate individual bytes with only 1 bit of overhead.
I’ve seen this done, but for a one-byte size class you’d have 1/8 overhead, for a very rare use case. You probably allocate so few individual bytes that you waste more memory by allocating a chunk to hold one allocation and not combining it with other small allocations. The bitmap approach is much better for large size classes where you can fit the bitmap in one or two registers and use clz instructions to find the one to pop.
The problem in general with the bitmap is that it’s O(1). I think SuperMalloc (or one of its rough contemporaries: I read a lot of malloc papers at the same time and they blur together a bit) did a hybrid where it had a free list as a linked list of 64-bit bitmaps, so you did clz to find the offset in the small chunk and then masked out that bit. If the result is zero, you popped the front from the list.
The problem in general with the bitmap is that it’s O(1).
Why is O(1) a problem? Surely that’s what you strive for no?
Also you might be able to build some fo the smaller size classes on demand? e.g. start at 8 or 16 bytes, but if you have a request that’s lower than that then you can build an ad-hoc arena out of an allocation into your lowest size class. That way you don’t have the upfront overhead of allocating, say, 72 bytes for 64 bytes of capacity of which you might only use 3, however those 3 bytes would only take a single slot into your smallest arena.
Why is O(1) a problem? Surely that’s what you strive for no?
When it’s an iPad-induced typo and I meant O(n).
If you have a large bitmap, you need to scan the entire size. It’s fine as a space / cache optimisation for fixed-sized chunks that fit in registers (e.g. a 64-bit bitmap) but if it grows to multiple KiBs it’s far slower than a linked list, where popping the head is O(1), the head is always in cache, and you want to pull the next node into cache so that it can be used immediately anyway.
Also you might be able to build some fo the smaller size classes on demand?
The mapping from size to size class is one of the most performance critical bits of a sizeclass allocator, so you absolutely don’t want to do anything dynamic there. You can do something on the slow path where you find that you don’t have something for a given size class, but given that one-byte allocations are so rare you may as well just map everything under 8 or 16 bytes to the smallest non-zero size class.
The Windows allocator uses an interesting hybrid approach. It starts looking like a range-based allocator, but tracks the number of allocations it’s seen in a given sizeclass and then moves to a sizeclass approach for common sizes.
A flag or a small semaphore are pretty reasonable things.
Typically, a lock or semaphore is protecting something and so you can just embed the lock and the thing in the same allocation.
More practically, you almost never want a spin lock, you want something more futex-like where the spin lock is the fast path and your thread can sleep and be woken when the lock is available. Most futex-like APIs work on 32- or 64-bit values. I think NT is the only common kernel that has a 1-byte futex-like operation.
typically the flag/semaphore will be embedded in and associated with some larger data structure, which larger structure will be what’s shared; not the flag specifically. then you don’t have a 1-byte allocation
you could allocate even smaller quantities that way if you had use for them, and you could also do lower expected space overhead (assuming workloads which genuinely need such tiny allocations are low-entropy in the same way as workloads that actually exist)
I was reading about memory alignment recently. As I understood, C23 clarified that there’s no requirement to max_align_t align everything (not implying that the article says otherwise).
https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2293.htm, “Alignment requirements for memory management functions”:
What’s interesting is that the documentation for https://en.cppreference.com/w/c/types/max_align_t contradicts this:
Here’s a discussion about this (which led me to the standards document): https://reviews.llvm.org/D118804
As the article mentions, there’s typically a practical concern before the standard one, that limits the smallest chunk that you can hand out: you need to track freelist metadata somehow.
The two memory allocators I’m most familiar with are fairly representative of the two common approaches. Snmalloc is a modern sizeclass-based allocator that is intended for high performance on large (i.e. mobile phone up to server) systems. CHERIoT RTOS’ allocator is a dlmalloc-based range allocator, optimised for memory consumption on small devices. In both cases, the smallest size that can be allocated is >1 byte.
Snmalloc allocates objects of the same size class from chunks. By default, I think (we’ve changed it a few times), chunks are 64 KiBs. When you allocate a new chunk (which happens the first time you allocate an object of a particular size class), it constructs a linked list of objects of that size in the chunk and then each subsequent allocation is just popping from the head of that list. Allocation is around 14 x86 instructions on the fast path. To be able to build the freelists, you need to be able to store a pointer (or, at least, a chunk offset) in each object and, for some security, there’s an option to store these with some error checking, which means that you can’t really scale the size classes down to less than 16 bytes very easily.
This approach is nice because there’s no per-object metadata. Wasted memory comes from two kinds of fragmentation. If a slab is partially used, the unused objects take space. New allocations can be made from these chunks, so that tends downwards as more memory is allocated, so the worst case (relative) overhead here is when you have very low total memory usage (which is why we don’t use this allocator in CHERIoT RTOS, but why I do prefer to use it everywhere else). The number of size classes is also finite. Snmalloc uses the ripple patter that, I think, was first proposed by SuperMalloc, where each size starts with 0b100, 0b101, 0b110, or 0b111. This ensures that, in the worst case, the allocated size is 12.5% large than the requested size.
In the dlmalloc-style allocator, the allocator stores a header before each chunk. Ranges are created by splitting other ranges. If you want to allocate 512 bytes, it will find a chunk that is at least 512 bytes (there are lots of different approaches here as to how it finds the chunk) and then split off a bit that is 512-byte + header. If you allocate an entire heap full of 8-byte objects and then free every other one, you fragment the heap and no subsequent allocation can succeed, but in general this approach works pretty well. The down side is that you need to be able to have the header adjacent to the object. For non-CHERI systems, this is a massive security risk because a one-byte underflow can corrupt allocator state and give an attacker a huge amount of control. For CHERI systems, it’s fine because the malloc consumer doesn’t have access to the header. You typically keep the headers aligned to simplify combining ranges on free, so that any combined range is always a
max_align_t-aligned thing at the start and end. In CHERIoT we have the additional constraint that our hardware temporal safety mechanism works in 8-byte chunks and so there’s no way to free anything smaller than 8 bytes.The one bit I’d disagree with in the article is the last line:
On a range-based allocator (on non-CHERI systems), realloc may be able to extend. You’ll have some surprising performance when it doesn’t. You also need to be very careful with realloc because the API is horrible:
Unless there’s a huge performance win from using realloc, don’t. It isn’t worth the code complexity. And measure it every week because small changes to your allocator (or other allocation patterns) may make it go away.
For a sizeclass-based allocator,
reallocis always either a no-op (the new size is the same sizeclass) or it is a malloc, memcpy, free, sequence. In the latter case, it’s much clearer to do this explicitly and APIs such asmalloc_usable_sizewill tell you whether you need to do it.In short,
reallocshould be regarded as an experts-only function. Any use of it is likely to either be wrong now, or wrong after some unrelated changes to the program (including libraries that it depends on, including libc). Friends don’t let friends callrealloc.I’ve previously moaned about how terrible realloc() is. This week I learned that there are even more problems!
There’s a pointer provenance footgun with realloc(): you cannot compare the pointer you passed in with the pointer it returned to find out whether the allocation moved, because after a pointer has been free()d it cannot be used for any purpose. (Maybe you can get away with a uintptr_t circumlocution?)
There is an issue related to dynamic size checks (FORTIFY_SOURCE) and malloc_usable_size(). It isn’t entirely clear what the semantics are, or even what they should be. If you want to avoid trouble, you must never stray outside the size you passed to malloc(), regardless of what malloc_usable_size() says; if you want to use the usable size, you must explicitly realloc() to grow the allocation to the usable size, and (per the previous paragraph’s footgun) you must remember not to use the original pointer after calling realloc() because its provenance is different even though you know its bit pattern is the same.
Good grief!
Are you 100% sure the standard is intended to be interpreted that way? The last time I investigated this issue I remember finding sufficient evidence that comparing freed pointers was well defined. It’s only undefined to dereference the freed pointer. The same semantics apply to null pointers, it’s perfectly fine to compare any pointer to a null pointer but undefined to dereference them.
Comparison is problematic in a provenance-carrying model (there’s a WG14 task group trying to define what the provenance-related semantics of C actually are). If I allocate an object, keep a pointer to it, free the object, and allocate a new object that happens to be at the same address, what should the result of comparing the two pointers be? Valid options are:
Typically, the third interpretation is used by compilers and folks in WG14.
This is even more interesting on platforms like CHERIoT where the hardware maintains the required ghost state and so we can differentiate between a pointer to a live object and a pointer to a freed object at the same address (the former will work the latter will trap if you try to use it). Even there, it’s fun because we can’t differentiate between two pointers to different freed objects of the same size and at the same location (neither pointer is usable and both are the same bit pattern in memory).
The intention of the standard writers matters less than how the standard is interpreted by the compiler writers 🙃 But I think in this case it’s clear that standard requires that free() has spooky side-effects on its argument. And it’s possible to imagine how such spooky side-effects could realistically happen: a compiler might mark a variable as dead when it is passed to free(), which would cause any subsequent uses of the variable to be compiled as if it were uninitialized, so a subsequent test for equality could use any arbitrary value instead of its value before free().
Following quotes are from n3220, which is C23 with one trivial editorial change in a footnote in annex K. Note that in C23 non-value representation is the new terminology for what was called a trap representation in earlier editions of the standard.
7.24.3 Memory management functions
6.2.4 Storage durations of objects
3.23 indeterminate representation
object representation that either represents an unspecified value or is a non-value representation
3.22.2 unspecified value
valid value of the relevant type where this document imposes no requirements on which value is chosen in any instance
3.24 non-value representation
an object representation that does not represent a value of the object type
6.2.6 Representations of types
6.2.6.1 General
When I looked at this issue I was looking at the C89 spec. For various reasons the text of the c23 standard is irrelevant to me but point well taken and appreciated.
As an aside, reflecting on my past usage of realloc(), I never had to check whether the allocation moved. For my own curiosity and edification, do you have a representative use case where detecting allocation movements was necessary?
I have a dead tree copy of C89 which says it is undefined behaviour much more straightforwardly than C23. (quotes below)
I don’t remember ever wanting to check if a realloc() didn’t move, but I imagine if it contained interior pointers, someone might want to skip fixing them up when possible.
For CHERI C, we make these guarantees a bit more explicit: The size of the object returned by a successful
mallocmust be at least the size requested and the caller may use all of that object.This makes a bit more sense on a CHERI system, where any read or write out of the bounds of the pointer returned by
mallocwill trap and the bounds of the pointer can be checked. It’s fine to callmallocand then pass the pointer to something else that then writes data into the entire buffer using its size, which means thatmallocmust either:It’s probably okay for it to be UB to write past the end of the requested size without checking the size of the returned value because the allocator may return something with precise bounds, but I’d almost want that to be IB because it’s fine to write past the end if you know the rounding rules that the allocator is using (or the representable bounds of your capability encoding, which gives the lower bound on the amount of padding the allocator must provide).
Yeah, you can do that when you are best buddies with your compiler and libc, and you know that you will never tickle __builtin_dynamic_object_size in a way it doesn’t like :-) Code that also runs in Red Hat land needs to be more circumspect.
Don’t some size-class allocators use a bitmap in the chunk header to determine which blocks are free? With that, you can allocate individual bytes with only 1 bit of overhead.
I’ve seen this done, but for a one-byte size class you’d have 1/8 overhead, for a very rare use case. You probably allocate so few individual bytes that you waste more memory by allocating a chunk to hold one allocation and not combining it with other small allocations. The bitmap approach is much better for large size classes where you can fit the bitmap in one or two registers and use clz instructions to find the one to pop.
The problem in general with the bitmap is that it’s O(1). I think SuperMalloc (or one of its rough contemporaries: I read a lot of malloc papers at the same time and they blur together a bit) did a hybrid where it had a free list as a linked list of 64-bit bitmaps, so you did clz to find the offset in the small chunk and then masked out that bit. If the result is zero, you popped the front from the list.
Why is O(1) a problem? Surely that’s what you strive for no?
Also you might be able to build some fo the smaller size classes on demand? e.g. start at 8 or 16 bytes, but if you have a request that’s lower than that then you can build an ad-hoc arena out of an allocation into your lowest size class. That way you don’t have the upfront overhead of allocating, say, 72 bytes for 64 bytes of capacity of which you might only use 3, however those 3 bytes would only take a single slot into your smallest arena.
When it’s an iPad-induced typo and I meant O(n).
If you have a large bitmap, you need to scan the entire size. It’s fine as a space / cache optimisation for fixed-sized chunks that fit in registers (e.g. a 64-bit bitmap) but if it grows to multiple KiBs it’s far slower than a linked list, where popping the head is O(1), the head is always in cache, and you want to pull the next node into cache so that it can be used immediately anyway.
The mapping from size to size class is one of the most performance critical bits of a sizeclass allocator, so you absolutely don’t want to do anything dynamic there. You can do something on the slow path where you find that you don’t have something for a given size class, but given that one-byte allocations are so rare you may as well just map everything under 8 or 16 bytes to the smallest non-zero size class.
The Windows allocator uses an interesting hybrid approach. It starts looking like a range-based allocator, but tracks the number of allocations it’s seen in a given sizeclass and then moves to a sizeclass approach for common sizes.
Now you need to track that arena somehow, and every free into that sizeclass needs to check “was that in an arena?”
Yes, I’ve seen this done, but each bit represented 8 bytes or 16 bytes (I can’t remember which).
ie >10%, and who allocates individual bytes?
Which is less than the 800 or 1600% you get if your smallest effective allocation is 8 or 16 bytes.
People who need them? e.g. if you need an atomic boolean or byte, if it can’t be static / global you’ll have to allocate a byte.
Although tbf in that case an arena might be counter-productive due to false sharing.
when do you need an atomic byte and that’s the only thing you need and you need it to be shared?
A flag or a small semaphore are pretty reasonable things.
An atomic only makes sense if you’re sharing it, if it’s not shared it does not need to be atomic.
Typically, a lock or semaphore is protecting something and so you can just embed the lock and the thing in the same allocation.
More practically, you almost never want a spin lock, you want something more futex-like where the spin lock is the fast path and your thread can sleep and be woken when the lock is available. Most futex-like APIs work on 32- or 64-bit values. I think NT is the only common kernel that has a 1-byte futex-like operation.
typically the flag/semaphore will be embedded in and associated with some larger data structure, which larger structure will be what’s shared; not the flag specifically. then you don’t have a 1-byte allocation
[Comment removed by author]
OK smartypants, how about “…you can allocate arbitrarily small sizes down to a single byte with only 1 bit of overhead.”
you could allocate even smaller quantities that way if you had use for them, and you could also do lower expected space overhead (assuming workloads which genuinely need such tiny allocations are low-entropy in the same way as workloads that actually exist)