Fortunately, heap allocation is not necessary to use C++ coroutines. However, C++20 coroutines do require dynamic allocation (memory that is allocated at runtime). Pigweed’s pw::async2::Coro API allocates memory using an pw::Allocator.
I don’t see a meaningful distinction between “heap allocation” and “dynamic allocation”, other than the scope of the allocator: whether it’s OS-provided and global, or custom and specific to some subsystem. Either way, it has the same effect.
What’s the benefit to using Pigweed’s custom allocator over the global operator new / malloc? If anything it’s a drawback, because in the case that multiple subsystems implement their own allocators, memory is wasted because each one is holding on to free space that the others can’t access.
Nevertheless, there’s a lot of interesting info here on the innards of C++ coroutines!
I don’t see a meaningful distinction between “heap allocation” and “dynamic allocation”, other than the scope of the allocator: whether it’s OS-provided and global, or custom and specific to some subsystem. Either way, it has the same effect.
A very meaningful distinction is that heap allocation, e.g. dynamic storage, will at some point call ::operator new, and has some chance of ending up doing a system call, which has implications in terms of latency for instance as this increases the chances for a thread to get scheduled-out. So far this was entirely precluding the use of coroutines in e.g. real-time audio processing loops unless you could assert at compile-time that HALO went into effect and every memory allocation got optimized out.
If anything it’s a drawback, because in the case that multiple subsystems implement their own allocators, memory is wasted because each one is holding on to free space that the others can’t access.
it’s definitely the only sane way to design some systems when you really want to make sure you aren’t ever going to take a lock.
There are a few reasons why someone might prefer a custom allocator over a global new or malloc.
In some cases, it’s a hard requirement: not all platforms provide a global allocator. This is fairly common in embedded systems.
Custom allocators can be beneficial for reliability, too– the exact amount of memory available for the purpose of allocating a particular coroutine or set of coroutines can be fixed, and failure to allocate from that dedicated pool of memory can be handled gracefully, rather than allowing the coroutine(s) to gradually expand into memory that is dedicated for other tasks.
Finally, using a custom allocator can sometimes improve performance by taking advantage of cache locality or certain allocation patterns. For example, bump allocators and arena allocators can provide more efficient allocation because they only deallocate at the end of certain scopes or lifetimes. This makes allocation just a simple pointer increment and makes deallocation a total no-op until the end of the arena’s lifetime.
You’re preaching to the choir! I agree custom heap allocators can be useful. But they’re still heaps. I found your article a bit misleading because you imply you’ve found a way to run C++coroutines without heap allocation, which you haven’t. You’ve just built your own heap.
Sorry for the misdirection! I didn’t intend to be misleading– this is an unfortunately common terminology mismatch. I’m using “heap allocation” to mean specifically the provided malloc/free/new/delete, and “dynamic allocation” to refer to the more general case of runtime memory allocation.
I do also discuss in this post why dynamic allocation is needed and how I hope its use can be avoided in the future, so I think it is still relevant to those hoping to avoid dynamic allocation completely (though I’m sad that it isn’t currently possible today).
In some cases, it’s a hard requirement: not all platforms provide a global allocator. This is fairly common in embedded systems.
But this requires you to bring along an allocator. And if you are going to bring along an allocator, why not just implement the global new and delete overloads?
The reason that I would want to avoid global new / delete on embedded platforms is that allocation is more likely to fail (because memory is tightly constrained) and so I need the caller to handle the failure case and I’m on an embedded platform that doesn’t support exceptions so it needs to handle allocation failure by the return address.
Your article mentions this about half way down:
Allow recovering from allocation failure without using exceptions.
From the API surface, it looks as if you provide a hook in the promise object that means that allocation failure can return a constant. This is really nice, you can have a coroutine call return something like an ErrorOr<T> and then return the error value if it fails to allocate space for the coroutine.
It would probably be nice to lead with that. I had to read a lot to figure out that you’ve actually solved the bit of the problem that I cared about.
Yes, handling the failure case gracefully is super important! RE ErrorOr<T>, Pigweed provides a pw::Result<T> which is exactly this– either a T or a pw::Status (our error code).
Coro<T> requires that T be convertible from pw::Status so that it can produce a value of T from pw::Status::Internal() in the case of allocation failure.
Custom allocators can be beneficial for reliability, too– the exact amount of memory available for the purpose of allocating a particular coroutine or set of coroutines can be fixed, and failure to allocate from that dedicated pool of memory can be handled gracefully, rather than allowing the coroutine(s) to gradually expand into memory that is dedicated for other tasks.
I’m going to flip this statement on its head for a sec. Using a fixed-sized arena allocator or something like that is great for having guarantees that you’re not going to blow up your heap. But… do we know at compile-time how big the allocations are going to be? Instead of using a fixed-sized arena, would it be possible to use something like a freelist allocator for each type that needs to get allocated? Like I could declare in my code that I’m going to assign a static allocation for 10 Coro objects and 30 frames; the allocator could return those from the freelist and return them to the freelist upon deallocation (both O(1) without needing to ever end the arena’s lifetime).
The size of the coroutine state will depend on the captured state (automatic variables at suspension points) so much like a lambda (closure type), its size is known by the compiler at compile time, but as far as I’m aware there’s no clear way to obtain it in your code at compile time, for example to be used as a template argument.
Yeah, this isn’t possible today, sadly– I discuss this a bit at the end of the post. If C++ offered a way to inspect the size of the coroutine of a specific function implementation, there would be no need for dynamic memory allocation: we could pre-reserve exactly the required amount of space. This would have huge benefits both for reliability and efficiency– no more need to guess at and adjust the right size for your allocators.
Little late getting back to this but I’ve had a little time to dig into some of the WGxx documents and I understand quite well why we can’t do this (yet anyway, hopefully someday!) For using it on an embedded system though, the HALO thing scares me a whole lot more than having to pre-reserve some space for an allocator (which also scares me less than having hidden/implicit malloc behind the scenes). On most of the smaller-sized embedded systems I work on, running out of space in a bounded arena and returning an error is a manageable thing. Changing the function that sets up a coroutine a little bit, resulting in HALO being available and significantly changing how much stack space is required… that’s spooky! At least with running out of arena space there’s well-defined error semantics. Walking off the end of a fixed-sized stack usually just means corruption.
I’m still digging a little more… it seems like there is a way to ensure HALO is disabled but having it potentially ok by default seems like a footgun for embedded, and we’ve got lots of footguns already!
(I’m still super intrigued through. This looks super cool!)
AFAIR the issue is that the size of the coroutine is only known by the compiler after optimization passes (as it depends on how local variables are stored, which lifetimes overlap or not, what can be reused) but the sizeof(...) operator is typically implemented in the frontend. The compiler writers said they would not be able to implement a coroutine proposal with frontend known size without heroic efforts. This was discussed extensively during the design of C++ coroutines, including a competing “core coroutine” proposal appearing and ending up dying on this same issue. Some alternatives were explored, such as types that did not support sizeof(…) but still could be stored on the stack, but it the end storing the coroutine on a heap with an escape hatch for ellision was deemed the best compromise.
A more general case of the example above would be if a were a variable-length array with x as the size. I would absolutely never write such a thing in embedded code (and typically not in normal C++ if there’s a chance an attacker can influence the savour of `x‘) but it is permitted.
Yeah I wasn’t thinking about the general case too much, just the specific example, and assuming the statically sized array was the point that seemed problematic. You’re right it’s not always possible… in most languages.
Rust does know the size of each Future, but it doesn’t let you have dynamically sized values on the stack (on stable). And using that unstable feature, it’s not possible to hold a dynamically sized value across await points (#61335) as await points are where the size needs to be known to be stored in the generated Future’s enum variant.
“Safe memory reclamation” is special jargon in this context.
If you are using RCU, hazard pointers, or other similar lock-free techniques, then a write to a datastructure cannot immediately free() any memory that was removed, because the memory can still be in use by concurrent readers. You have to wait for a grace period or quiescent period to pass before it can be freed. You don’t want to do this as part of the write transaction because that will ruin your write throughput: grace periods are relatively long and can accommodate multiple write transactions.
So you need to hang on to the to-be-freed memory somehow and reclaim it later. Sometimes the whole concurrent algorithm is easier if you guarantee type-stable memory by using a region of some kind. Sometimes you can save significant amounts of memory (a pointer per node) by using a GC instead of a to-be-freed list.
I’m not sure how custom allocators help here. They are not invoked until after the destructor has run, so prolonging the lifetime of the allocation using a custom allocator doesn’t help because the object is gone (you may get an arbitrary bit pattern if you try to read it). Things like RCU work fine without a custom allocator because they defer destruction, not deletion.
They are not invoked until after the destructor has run,
Note that C++20 adds “destroying delete” which lets you define allocators that take care of the destruction themselves. But as linked article says it’s not standardized for the allocator methods on coroutine promise types >_<
Interesting. The Solaris VMEM allocator gets some really nice performance benefits by resetting deallocated objects to a partially initialised state (allocation often happens on hot paths, deallocation is easy to defer. You can defer not needing an object far more easily than you can defer needing one). I believe that could be represented in C++20 now, because delete would not call the destructor and would just add the object to the pool, leaving it as a valid object of the correct type.
I don’t know how you can do things like data-structure-specialized GC or type-stable memory without a custom allocator. You don’t need those techniques for RCU (RCU was an example to set the scene) but they are sometimes useful for concurrent data structures. I also dunno how this would fit with the C++ facilities for custom allocators — I guess you have to avoid destruction until the memory is about to be freed (especially for type-stable memory)
Type-stable memory can be done with custom allocators, but it’s often better to do it with a class overriding operators new and delete. You can use CRTP to implement a superclass that provides a custom memory pool for each type.
I could be wrong but my recollection is that they use alloca which allocates arbitrary stack space instead of using actual allocators (which typically involve a system call in the worst case).
I don’t see a meaningful distinction between “heap allocation” and “dynamic allocation”, other than the scope of the allocator: whether it’s OS-provided and global, or custom and specific to some subsystem. Either way, it has the same effect.
What’s the benefit to using Pigweed’s custom allocator over the global operator new / malloc? If anything it’s a drawback, because in the case that multiple subsystems implement their own allocators, memory is wasted because each one is holding on to free space that the others can’t access.
Nevertheless, there’s a lot of interesting info here on the innards of C++ coroutines!
A very meaningful distinction is that heap allocation, e.g. dynamic storage, will at some point call ::operator new, and has some chance of ending up doing a system call, which has implications in terms of latency for instance as this increases the chances for a thread to get scheduled-out. So far this was entirely precluding the use of coroutines in e.g. real-time audio processing loops unless you could assert at compile-time that HALO went into effect and every memory allocation got optimized out.
it’s definitely the only sane way to design some systems when you really want to make sure you aren’t ever going to take a lock.
There are a few reasons why someone might prefer a custom allocator over a global
newormalloc.In some cases, it’s a hard requirement: not all platforms provide a global allocator. This is fairly common in embedded systems.
Custom allocators can be beneficial for reliability, too– the exact amount of memory available for the purpose of allocating a particular coroutine or set of coroutines can be fixed, and failure to allocate from that dedicated pool of memory can be handled gracefully, rather than allowing the coroutine(s) to gradually expand into memory that is dedicated for other tasks.
Finally, using a custom allocator can sometimes improve performance by taking advantage of cache locality or certain allocation patterns. For example, bump allocators and arena allocators can provide more efficient allocation because they only deallocate at the end of certain scopes or lifetimes. This makes allocation just a simple pointer increment and makes deallocation a total no-op until the end of the arena’s lifetime.
You’re preaching to the choir! I agree custom heap allocators can be useful. But they’re still heaps. I found your article a bit misleading because you imply you’ve found a way to run C++coroutines without heap allocation, which you haven’t. You’ve just built your own heap.
Sorry for the misdirection! I didn’t intend to be misleading– this is an unfortunately common terminology mismatch. I’m using “heap allocation” to mean specifically the provided malloc/free/new/delete, and “dynamic allocation” to refer to the more general case of runtime memory allocation.
I do also discuss in this post why dynamic allocation is needed and how I hope its use can be avoided in the future, so I think it is still relevant to those hoping to avoid dynamic allocation completely (though I’m sad that it isn’t currently possible today).
But this requires you to bring along an allocator. And if you are going to bring along an allocator, why not just implement the global
newanddeleteoverloads?The reason that I would want to avoid global new / delete on embedded platforms is that allocation is more likely to fail (because memory is tightly constrained) and so I need the caller to handle the failure case and I’m on an embedded platform that doesn’t support exceptions so it needs to handle allocation failure by the return address.
Your article mentions this about half way down:
From the API surface, it looks as if you provide a hook in the promise object that means that allocation failure can return a constant. This is really nice, you can have a coroutine call return something like an
ErrorOr<T>and then return the error value if it fails to allocate space for the coroutine.It would probably be nice to lead with that. I had to read a lot to figure out that you’ve actually solved the bit of the problem that I cared about.
Yes, handling the failure case gracefully is super important! RE
ErrorOr<T>, Pigweed provides apw::Result<T>which is exactly this– either aTor apw::Status(our error code).Coro<T>requires thatTbe convertible frompw::Statusso that it can produce a value ofTfrompw::Status::Internal()in the case of allocation failure.Sounds nice. I might have a look and see if it can be ported to CHERIoT.
I’m going to flip this statement on its head for a sec. Using a fixed-sized arena allocator or something like that is great for having guarantees that you’re not going to blow up your heap. But… do we know at compile-time how big the allocations are going to be? Instead of using a fixed-sized arena, would it be possible to use something like a freelist allocator for each type that needs to get allocated? Like I could declare in my code that I’m going to assign a static allocation for 10
Coroobjects and 30 frames; the allocator could return those from the freelist and return them to the freelist upon deallocation (both O(1) without needing to ever end the arena’s lifetime).The size of the coroutine state will depend on the captured state (automatic variables at suspension points) so much like a lambda (closure type), its size is known by the compiler at compile time, but as far as I’m aware there’s no clear way to obtain it in your code at compile time, for example to be used as a template argument.
Yeah, this isn’t possible today, sadly– I discuss this a bit at the end of the post. If C++ offered a way to inspect the size of the coroutine of a specific function implementation, there would be no need for dynamic memory allocation: we could pre-reserve exactly the required amount of space. This would have huge benefits both for reliability and efficiency– no more need to guess at and adjust the right size for your allocators.
Little late getting back to this but I’ve had a little time to dig into some of the WGxx documents and I understand quite well why we can’t do this (yet anyway, hopefully someday!) For using it on an embedded system though, the HALO thing scares me a whole lot more than having to pre-reserve some space for an allocator (which also scares me less than having hidden/implicit malloc behind the scenes). On most of the smaller-sized embedded systems I work on, running out of space in a bounded arena and returning an error is a manageable thing. Changing the function that sets up a coroutine a little bit, resulting in HALO being available and significantly changing how much stack space is required… that’s spooky! At least with running out of arena space there’s well-defined error semantics. Walking off the end of a fixed-sized stack usually just means corruption.
I’m still digging a little more… it seems like there is a way to ensure HALO is disabled but having it potentially ok by default seems like a footgun for embedded, and we’ve got lots of footguns already!
(I’m still super intrigued through. This looks super cool!)
AFAIR the issue is that the size of the coroutine is only known by the compiler after optimization passes (as it depends on how local variables are stored, which lifetimes overlap or not, what can be reused) but the
sizeof(...)operator is typically implemented in the frontend. The compiler writers said they would not be able to implement a coroutine proposal with frontend known size without heroic efforts. This was discussed extensively during the design of C++ coroutines, including a competing “core coroutine” proposal appearing and ending up dying on this same issue. Some alternatives were explored, such as types that did not support sizeof(…) but still could be stored on the stack, but it the end storing the coroutine on a heap with an escape hatch for ellision was deemed the best compromise.Ahhhhh interesting, that makes sense. I’m going to put it on my list of things to play with :). Cool project!
.. is it though ? how does that work if you have
It’s interested in the upper bound as common when you’re pre-allocating. So compute both, take the largest.
A more general case of the example above would be if
awere a variable-length array withxas the size. I would absolutely never write such a thing in embedded code (and typically not in normal C++ if there’s a chance an attacker can influence the savour of `x‘) but it is permitted.exactly, after all you can have recursive calls or alloca so there’s no way to precompute the entire required stack size statically
Yeah I wasn’t thinking about the general case too much, just the specific example, and assuming the statically sized array was the point that seemed problematic. You’re right it’s not always possible… in most languages.
Rust does know the size of each Future, but it doesn’t let you have dynamically sized values on the stack (on stable). And using that unstable feature, it’s not possible to hold a dynamically sized value across await points (#61335) as await points are where the size needs to be known to be stored in the generated Future’s enum variant.
Another reason is to support safe memory reclamation for a concurrent data structure.
Not sure what you mean by this? Just about all system allocators are thread-safe.
“Safe memory reclamation” is special jargon in this context.
If you are using RCU, hazard pointers, or other similar lock-free techniques, then a write to a datastructure cannot immediately free() any memory that was removed, because the memory can still be in use by concurrent readers. You have to wait for a grace period or quiescent period to pass before it can be freed. You don’t want to do this as part of the write transaction because that will ruin your write throughput: grace periods are relatively long and can accommodate multiple write transactions.
So you need to hang on to the to-be-freed memory somehow and reclaim it later. Sometimes the whole concurrent algorithm is easier if you guarantee type-stable memory by using a region of some kind. Sometimes you can save significant amounts of memory (a pointer per node) by using a GC instead of a to-be-freed list.
I’m not sure how custom allocators help here. They are not invoked until after the destructor has run, so prolonging the lifetime of the allocation using a custom allocator doesn’t help because the object is gone (you may get an arbitrary bit pattern if you try to read it). Things like RCU work fine without a custom allocator because they defer destruction, not deletion.
Note that C++20 adds “destroying delete” which lets you define allocators that take care of the destruction themselves. But as linked article says it’s not standardized for the allocator methods on coroutine promise types >_<
https://stackoverflow.com/questions/67595789/what-is-destroying-operator-delete-in-c20
Interesting. The Solaris VMEM allocator gets some really nice performance benefits by resetting deallocated objects to a partially initialised state (allocation often happens on hot paths, deallocation is easy to defer. You can defer not needing an object far more easily than you can defer needing one). I believe that could be represented in C++20 now, because delete would not call the destructor and would just add the object to the pool, leaving it as a valid object of the correct type.
I don’t know how you can do things like data-structure-specialized GC or type-stable memory without a custom allocator. You don’t need those techniques for RCU (RCU was an example to set the scene) but they are sometimes useful for concurrent data structures. I also dunno how this would fit with the C++ facilities for custom allocators — I guess you have to avoid destruction until the memory is about to be freed (especially for type-stable memory)
Type-stable memory can be done with custom allocators, but it’s often better to do it with a class overriding operators new and delete. You can use CRTP to implement a superclass that provides a custom memory pool for each type.
I could be wrong but my recollection is that they use
allocawhich allocates arbitrary stack space instead of using actual allocators (which typically involve a system call in the worst case).