Use-after-free bugs are a difficult issue to solve on the road to C++ memory safety. The solution provided by MiraclePtr is a good compromise between safety, compatibility, and efficiency. It’s encouraging to see that there is evidence it is effective but it would be nice to do even better here.
MiraclePtr is limited to class members of non-trivial classes. This is because classes that can be trivially constructed or destructed can be memcpy’d and that will sidestep the ref counting that MiraclePtr does. In general this is one of the biggest issues that makes pointer invalidation hard to do efficiently. The way most UAF mitigations work is by quarantining freed memory so that when a dangling pointer is used, that memory will not have been reused by the allocator. If it had been, that would have allowed attackable data to be put there. The question then becomes how long should freed memory be quarantined? It can’t be forever because that would cause memory use to grow indefinitely. MiraclePtr solves this by ref counting on construction and destruction. Once the ref count goes to 0, then the memory is allowed to be reused. Unfortunately this limits the range of its uses.
As far as I can tell, CHERI solves this by quarantining freed memory and then when the amount of quarantined memory passes a heuristic threshold, running a scan throughout the entire process to see if any capabilities still reference the quarantined memory. If any do, they are marked invalid and the memory is properly freed. This is not unlike a garbage collection and it comes with all the issues that plague transparent garbage collections. On the other hand, this approach makes CHERI 100% effective against UAF exploits. @david_chisnall, please correct me if I’m wrong in the previous summary.
I don’t think garbage collections are an acceptable trade off for many C++ use cases. If a solution like MiraclePtr is going to be further developed, I think it will require a subset of C++ that disallows pointer variables in any context that allows memcpy() through a char* alias, enforced by a linter or the compiler. In the abstract, this seems like a good direction in which to move. Allowing the programmer a side-channel to alias and copy data makes code harder to statically analyze for other potential reasons.
There are others working on more comprehensive reference models for a memory-safe subset of C++ but my hunch is that it will be met with apathy if it requires programming C++ in a much different way, even if it requires less runtime checks. My guess is that the solution that will resonate the most with the wider C++ community is one that minimizes the restrictions in the memory-safe subset at the cost of some small amount of additional runtime cost. Also a solution that can be turned on or off at will, without having to change any code. Something like AddressSanitizer but that is much more efficient due to the subset restrictions.
As far as I can tell, CHERI solves this by quarantining freed memory and then when the amount of quarantined memory passes a heuristic threshold, running a scan throughout the entire process to see if any capabilities still reference the quarantined memory. If any do, they are marked invalid and the memory is properly freed. This is not unlike a garbage collection and it comes with all the issues that plague transparent garbage collections. On the other hand, this approach makes CHERI 100% effective against UAF exploits. @david_chisnall, please correct me if I’m wrong in the previous summary.
That’s basically right. We have a paper coming out at ASPLOS describing how we make it low latency at ASPLOS. On Morello, there’s a generation bit in the page table and in a per-cpu register that lets us take a trap on the first load from every page, so that we can scan that page on demand and (future work) and pages easily reachable from the loaded address. We also scan in the background, so most loads don’t trap. This ensures that you can’t load pointers that were dangling at the start of a revocation epoch into the register file, so we never need to scan a page more than once. My team did a lot of work on how to make this fast on 128-core machines with >1 TiB of RAM with various hardware offloads.
The abstract model is the same as Microsoft’s Project Snowflake, which added manual memory management to .NET. The guarantee that you get is that a dangling pointer will either point to the old object, or will trap when you access it. This makes most UAFs non-exploitable, but there’s one case that still worries me slightly: if you have run a destructor or equivalent then the object will be in an undefined state and so using it after it’s been destroyed may give the attacker some power. We have not yet found any examples where this is a useful attack primitive but I suspect that is mostly because there’s a lot of lower-hanging fruit on existing systems.
On CHERIoT, we put the revocation bitmap (the thing that marks memory as freed) in hardware. We check this when we load pointers and so we guarantee that you cannot load a dangling pointer. We also do the revocation scan entirely in hardware. This approach works because our RAM is very small and fast.
For larger systems, we’ve looked at composing with MTE. CHERI adds spatial safety and an authorisation model, so you can’t recolour memory without a capability that explicitly authorises this (rather than using a permission bit, we reserve one colour as allocator-owned). This removes the need for adjacent allocations to be differently coloured and removes the need for colours to be unpredictable. It also lets us get deterministic UAF protection and do the revocation sweeps 1/15th as often for the same memory overhead. Oh, and it lets us store freelist data in band, which is nice. In this model, you allocate memory with colour 15, count down until it reaches 0, then stick objects in a quarantine list. We can also have a mode that is easier for the microarchitecture to implement, where we trap on loads with colour mismatches (which is cheap because tags and data flow together in the cache and the load can’t retire until the data arrives in cache) and we silently ignore mismatches on store (we pass the colour to the L1 cache and it discards the result after loading the tags), which keeps things in store-forwarding queue a bit longer but has little other impact. With a good microarchitecture, we estimate this will have a 1-2% overhead relative to the 2% that a good CHERI implementation should be able to provide.
If the revocation runs asynchronously in the background, does that mean there is a (short?) window of time where UAF is still possible? Or would that trap due to e.g. the aforementioned generation count on Morello?
That traps. There’s a (very) short rendezvous phase where every running thread that shares a memory map has the revocation bit for the core that it’s running on toggled. The trap happens if you load a tagged capability via a page where the PTE bit does not match the MSR bit[1]. After the rendezvous, every core’s MSR bit will be toggled, but no PTE bits will. This means any load of a pointer will trap[2]. When this happens, we take a trap into the kernel, scan the page, clear the tag on any pointers to free’d memory, and then update the PTE bit and resume. At the same time, a background kernel thread walks the address map[3] and invalidates any dangling pointers and updates PTEs.
The thing I forgot to mention in my first post: this shares a lot of implementation techniques with concurrent GC, but revocation and GC are logical duals. GC ensures that objects remain live (at least) until the last pointer to them goes away. Revocation ensures that pointers do not outlive objects. Revocation starts with knowing which objects have been freed, GC has to discover this. GC deletes objects once they’re not reachable, revocation deletes pointers once their objects have been deallocated. This means that revocation does not have to do a graph walks, it can do a much simpler scan of all live pages that have had pointers stored to them.
[1] Future work will explore how much we can relax this. I think we can actually get away with a lot more spurious traps without hurting performance.
[2] This is also not ideal, we should probably run the revocation at least on the top page of the stack and pages directly reachable from the register file when we do the initial rendezvous, because scanning one page is a lot cheaper than taking a fault.
[3] Currently, it does this in address order. One of the things I’d like us to add is a small queue of high-probability pages found in the synchronous trap that it should try scanning out of order.
Use-after-free bugs are a difficult issue to solve on the road to C++ memory safety. The solution provided by MiraclePtr is a good compromise between safety, compatibility, and efficiency. It’s encouraging to see that there is evidence it is effective but it would be nice to do even better here.
MiraclePtr is limited to class members of non-trivial classes. This is because classes that can be trivially constructed or destructed can be memcpy’d and that will sidestep the ref counting that MiraclePtr does. In general this is one of the biggest issues that makes pointer invalidation hard to do efficiently. The way most UAF mitigations work is by quarantining freed memory so that when a dangling pointer is used, that memory will not have been reused by the allocator. If it had been, that would have allowed attackable data to be put there. The question then becomes how long should freed memory be quarantined? It can’t be forever because that would cause memory use to grow indefinitely. MiraclePtr solves this by ref counting on construction and destruction. Once the ref count goes to 0, then the memory is allowed to be reused. Unfortunately this limits the range of its uses.
As far as I can tell, CHERI solves this by quarantining freed memory and then when the amount of quarantined memory passes a heuristic threshold, running a scan throughout the entire process to see if any capabilities still reference the quarantined memory. If any do, they are marked invalid and the memory is properly freed. This is not unlike a garbage collection and it comes with all the issues that plague transparent garbage collections. On the other hand, this approach makes CHERI 100% effective against UAF exploits. @david_chisnall, please correct me if I’m wrong in the previous summary.
I don’t think garbage collections are an acceptable trade off for many C++ use cases. If a solution like MiraclePtr is going to be further developed, I think it will require a subset of C++ that disallows pointer variables in any context that allows memcpy() through a char* alias, enforced by a linter or the compiler. In the abstract, this seems like a good direction in which to move. Allowing the programmer a side-channel to alias and copy data makes code harder to statically analyze for other potential reasons.
There are others working on more comprehensive reference models for a memory-safe subset of C++ but my hunch is that it will be met with apathy if it requires programming C++ in a much different way, even if it requires less runtime checks. My guess is that the solution that will resonate the most with the wider C++ community is one that minimizes the restrictions in the memory-safe subset at the cost of some small amount of additional runtime cost. Also a solution that can be turned on or off at will, without having to change any code. Something like AddressSanitizer but that is much more efficient due to the subset restrictions.
That’s basically right. We have a paper coming out at ASPLOS describing how we make it low latency at ASPLOS. On Morello, there’s a generation bit in the page table and in a per-cpu register that lets us take a trap on the first load from every page, so that we can scan that page on demand and (future work) and pages easily reachable from the loaded address. We also scan in the background, so most loads don’t trap. This ensures that you can’t load pointers that were dangling at the start of a revocation epoch into the register file, so we never need to scan a page more than once. My team did a lot of work on how to make this fast on 128-core machines with >1 TiB of RAM with various hardware offloads.
The abstract model is the same as Microsoft’s Project Snowflake, which added manual memory management to .NET. The guarantee that you get is that a dangling pointer will either point to the old object, or will trap when you access it. This makes most UAFs non-exploitable, but there’s one case that still worries me slightly: if you have run a destructor or equivalent then the object will be in an undefined state and so using it after it’s been destroyed may give the attacker some power. We have not yet found any examples where this is a useful attack primitive but I suspect that is mostly because there’s a lot of lower-hanging fruit on existing systems.
On CHERIoT, we put the revocation bitmap (the thing that marks memory as freed) in hardware. We check this when we load pointers and so we guarantee that you cannot load a dangling pointer. We also do the revocation scan entirely in hardware. This approach works because our RAM is very small and fast.
For larger systems, we’ve looked at composing with MTE. CHERI adds spatial safety and an authorisation model, so you can’t recolour memory without a capability that explicitly authorises this (rather than using a permission bit, we reserve one colour as allocator-owned). This removes the need for adjacent allocations to be differently coloured and removes the need for colours to be unpredictable. It also lets us get deterministic UAF protection and do the revocation sweeps 1/15th as often for the same memory overhead. Oh, and it lets us store freelist data in band, which is nice. In this model, you allocate memory with colour 15, count down until it reaches 0, then stick objects in a quarantine list. We can also have a mode that is easier for the microarchitecture to implement, where we trap on loads with colour mismatches (which is cheap because tags and data flow together in the cache and the load can’t retire until the data arrives in cache) and we silently ignore mismatches on store (we pass the colour to the L1 cache and it discards the result after loading the tags), which keeps things in store-forwarding queue a bit longer but has little other impact. With a good microarchitecture, we estimate this will have a 1-2% overhead relative to the 2% that a good CHERI implementation should be able to provide.
If the revocation runs asynchronously in the background, does that mean there is a (short?) window of time where UAF is still possible? Or would that trap due to e.g. the aforementioned generation count on Morello?
That traps. There’s a (very) short rendezvous phase where every running thread that shares a memory map has the revocation bit for the core that it’s running on toggled. The trap happens if you load a tagged capability via a page where the PTE bit does not match the MSR bit[1]. After the rendezvous, every core’s MSR bit will be toggled, but no PTE bits will. This means any load of a pointer will trap[2]. When this happens, we take a trap into the kernel, scan the page, clear the tag on any pointers to free’d memory, and then update the PTE bit and resume. At the same time, a background kernel thread walks the address map[3] and invalidates any dangling pointers and updates PTEs.
The thing I forgot to mention in my first post: this shares a lot of implementation techniques with concurrent GC, but revocation and GC are logical duals. GC ensures that objects remain live (at least) until the last pointer to them goes away. Revocation ensures that pointers do not outlive objects. Revocation starts with knowing which objects have been freed, GC has to discover this. GC deletes objects once they’re not reachable, revocation deletes pointers once their objects have been deallocated. This means that revocation does not have to do a graph walks, it can do a much simpler scan of all live pages that have had pointers stored to them.
[1] Future work will explore how much we can relax this. I think we can actually get away with a lot more spurious traps without hurting performance.
[2] This is also not ideal, we should probably run the revocation at least on the top page of the stack and pages directly reachable from the register file when we do the initial rendezvous, because scanning one page is a lot cheaper than taking a fault.
[3] Currently, it does this in address order. One of the things I’d like us to add is a small queue of high-probability pages found in the synchronous trap that it should try scanning out of order.