Hopefully this will start to comvince Apple to invest in CHERI since it moves the narrative from the overhead of CHERI (pointers get twice as big) to something that’s a net win on every dimension!
Pointers get a third smaller.
Pointer checks are an order of magnitude cheaper.
Code size gets smaller.
Pointers are unforgeable (type confusion can’t allow an attacker to inject pointers via I/O).
Pointers carry bounds.
Pointers are distinguishable from integers in memory and so temporal safety is possible
The compiler is out of the TCB and so memory safety can be used as a building block for compartmentalising untrusted or mutually distrusting components.
When framed like this, adding a successor to Morello in Apple Silicon is such an obvious win that I want to meet the VP who can convincingly argue against it.
The whole point here is that pointers don’t increase in size in this model.
While I recognize that the code size will be reduced somewhat, I’m not really sold on CHERI, I feel the ABI is expensive and the checks required seem to unavoidably burn resources (silicon, predictor state, I guess retire buffers, etc). On the other hand things like pointer auth, MTE, etc all burn resources as well - I especially love the “low level” programmers insisting that not having bounds or type safety for performance, while ignoring the cost of all the things trying to make the system secure in the face of the inevitable errors :-/
The whole point here is that pointers don’t increase in size in this model.
I misunderstood, there was a picture of a pointer expanded to base, top, and address. I guess that’s only within a region, not across API boundaries?
While I recognize that the code size will be reduced somewhat, I’m not really sold on CHERI, I feel the ABI is expensive and the checks required seem to unavoidably burn resources (silicon, predictor state, I guess retire buffers, etc)
Silicon is a bit broad, but the area in the core for CHERI is largely the extra size of rename registers. You need a couple of adders and some comparators in the load-store unit, but that’s basically noise for a modern architecture. Not sure what you mean by predictor state, you can just predict addresses from the branch predictor though for side channel mitigation you might want to do a bit more. Morello performance suffers because they don’t predict on any operation that can change the bounds because they didn’t want to change the floorplan of the N1 and widening the path from the branch predictor to the instruction fetch unit was impossible. No change to retire buffers except for some of the experimental stuff for temporal safety. The other big source of overhead on Morello is that they didn’t widen the store queues so store-pair on capabilities consumes two store queue entries. Widening the queues is usually much cheaper than lengthening them (control logic complexity scales with length, widening the entries is just more flip flops).
On modern superscalar architectures, register rename is the most annoying component because it is complex and must be powered any time that instructions are executing. CHERI reduces register rename pressure relative to most software mitigations (bounds and permissions are carried with addresses in registers) and so may cause a power reduction in some workloads.
On the other hand things like pointer auth, MTE, etc all burn resources as well
Pointer auth requires a lot more logic to implement the checks than CHERI does. MTE requires a lot more invasive changes to the pipeline than CHERI does. MTE also increases the memory subsystem complexity by more than CHERI, though ideally we’d have both.
If you compose CHERI and MTE as proposed by my team, you get some nice properties. MTE doesn’t need to enforce spatial safety and MTE colour values are not secrets (it’s trivial to leak MTE tags via speculative side channels, so MTE provides no protection against an attacker who can inject pointers and has access to a single spectre gadget). This means that your allocator can assign MTE tags predictably.
You can also avoid a lot of the overheads of MTE (when combined with CHERI) by having a mode where loads trap and stores fizzle. Loads trapping has a small increase on retire buffer size but you can’t fully retire a load until the data is available anyway so this is negligible. When you store, you send the tags through the store queue and simply discard it once you have loaded the tags from memory (this benefits a lot from the Arm memory model). You can’t do that without CHERI because it prevents MTE from catching large linear overflows that go past the next object into another one, but linear overflows are a local property and so using a global mechanism to check them is a massive waste of your power and complexity budget.
Microarchitects building high-performance chips that I’ve talked to believe that the total overhead of MTE in this model would be around 2%. From the perspective of the C abstract machine, loading from a deallocated object traps, storing to one stores to the old one and you have no mechanism to load that result and so you can’t observe that it didn’t occur. This would let us reduce the frequency of the revocation sweeps for temporal safety by a factor of 15, which should make the total cost of complete spatial and temporal safety 5-10% for most workloads. You’re paying more than that for weak probabilistic mitigations today. For a lot of short-lived programs (e.g. clang), this means that the frequency of the revocation sweep would be less than the run time of the program and so you’d be able to just use MTE for temporal safety.
I especially love the “low level” programmers insisting that not having bounds or type safety for performance, while ignoring the cost of all the things trying to make the system secure in the face of the inevitable errors :-/
I’d take that further. Systems programmers often build complex things to work around the fact that they don’t have memory safety. The dances that iOS does with process isolation and (buggy, insecure) serialisation and deserialisation routines are a good example of this. They cost orders of magnitude more than doing the same thing with CHERI.
With CHERIoT, we’ve shown that you can build interesting things if the lowest-level code in the system can rely on memory safety. The TCB for confidentiality and availability in our system is <350 instructions (i.e. smaller than the unverified bit of seL4, and we have some folks starting to formally verify it). The sealing mechanism means that the scheduler never has access to the state of the threads that it’s managing, so is in the TCB for availability but not confidentiality or integrity. The (shared) heap allocator is in the TCB for confidentiality and integrity for heap objects. The overhead of adding a compartment is on the order of 40 bytes, so adding compartments is cheap.
The public demo that we did a few weeks ago connected to the Azure IoT hub, fetched some JavaScript bytecode, and executed it, with the TCP/IP, TLS, and MQTT libraries all in separate compartments, the JavaScript running in a compartment. Everything had complete non-bypassable, spatial and temporal memory safety, in 256 KiB of SRAM.
You can’t build something like this if memory safety is a best-effort thing enforced on cooperating code. Personally, I consider the fact that we deterministically mitigate around 70% of vulnerabilities in C/C++ code a nice side benefit. The thing I really care about is that we provide a path to using C/C++ libraries in safe-language programs without bugs in those libraries being able to compromise the high-level abstract machines. Web browsers would have evolved very differently if we’d had CHERI in the ’90s and things like ActiveX could have been made secure with minimal overhead.
I misunderstood, there was a picture of a pointer expanded to base, top, and address. I guess that’s only within a region, not across API boundaries?
Correct - there are explicit annotations to get wide pointers (search indexable, bidi_indexable). Things that aren’t exposed to the ABI (locals, etc) are logically wide pointers.
For you CHERI comments I don’t do silicon design, but I’ll assume your position is correct in terms of design complexity.
What has been your experience compiling existing C/C++ projects? How much code churn has been required to deal with the change in pointer size? My experience implementing pointer auth means I’m somewhat cynical about how such code treats pointers vs data as interchangeable.
What has been your experience compiling existing C/C++ projects? How much code churn has been required to deal with the change in pointer size?
I did a load of work to minimise this and then Alex and Jess did more. In particular, the semantics of things like intptr_t and unions of those and pointers just work as you’d expect (which is not the case with pointer signing). The porting difficulty changes with how low level the code is. For application code, it’s typically under 0.02% of lines. For systems code (kernels, language runtimes), it’s around 0.2%. For JIT compilers it’s around 2%.
On Morello, we’re able to run FreeBSD (kernel and userland), Weston (including GPU drivers), KDE, and a load of applications all in pure capability mode (every pointer is a CHERI capability). A lot of the KDE libraries and applications just worked with no source code modification. A lot of the changes were easy to upstream because they are just using the correct C types for things. The Chromium port is nearly finished (WebKit was ported with a two versions of JavaScriptCore, one that used capabilities for JavaScript pointers and one that used integer offsets within the heap: this was the 2% change).
On CHERIoT, we run the TPM reference stack, mBedTLS, the FreeRTOS network stack and MQTT library, and the Microvium JavaScript interpreter. None of these requires any changes to get full spatial and temporal memory safety. We did need a few small tweaks to make a few things run on RISC-V (per-architecture defines) and a few more to compartmentalise a few of the components.
To add to that, the CHERI port of FreeBSD currently ships around 8,000 packages with the pure-cap ABI and 20,000 with the default AArch64 ABI. This is out of a total of 30,000 that build on x86. Getting to this level was a few (under 10) person years of work. A lot of the missing ones are blocked on dependencies and often fixing one then causes a few hundred to build. I believe we’ll have approximate parity with AArch64 on Morello in a year with the current investment in people. The Chromium port is being done by a single person. He’s spent less than six months on it and (I think) has everything except v8 finished (I expect v8 to be the most work).
If someone decided to ship CHERI hardware for a *NIX ecosystem today then it is quite feasible to get everything ported to the new ABI before the silicon ships. The difficulty for Apple would be squeezing the third party ecosystem into the gap between announcing and shipping the hardware, but I think that’s feasible. When you need someone to lead that effort across your hardware/software teams, your recruiters have my contact details…
That’s invalid under current C syntax, as N has to be declared prior to use - this issue is specifically addressed in the RFC:
array syntax with a specified bound is supported, e.g void foo(size_t N, int ptr[N]); works as expected and is equivalent to int *__counted_by(N) ptr
the need for out of lexical order access to bounds is recognized and supported (search for lazy parsing)
But it’s also important to recognize that that syntax is insufficient. It doesn’t support anything other than referencing another value by name, e.g. it can’t do:
void foo(int N, int M, int buffer[N * M])
void foo(struct SomeThing S, int buffer[S.count])
or any other non-trivial expressions that are necessary to accurately reflect real world APIs.
That syntax also only works for parameters, and cannot express bounds of return values, fields, globals, locals, etc.
This RFC is a single system, that applies consistently to all pointers in all cases, and most importantly handles all the cases we see in real world C APIs.
I understand the current limitations on the semantics. In order to make it widely adoptable, it would be better to adopt the currently available syntax with a new keyword, instead of introducing a new syntax. Something like
Why are we adding __Bound here? it isn’t unnecessary.
If you have
void foo(size_t N, int ptr[N]);
It does not need an additional annotation. That just works as you would expect. The only time you need to explicitly use these attributes is if
The attributes are needed for all the cases where that syntax does not work.
For example, you can’t make
void foo(size_t N, size_t M, int ptr[N*M])
work with the existing syntax because you break existing compilers in a way that can’t easily be opted out of via a macro.
The same happens with the gcc extension to pre-declare parameters
void foo(size_t N; int ptr[N], N)
You can’t easily make that work via macros in compilers that don’t support gcc’s extension.
But you’re also focusing on just parameters, whereas this set of attributes applies to the type, and can be placed more or less anywhere a type is used, for example, the fairly critical case of a return value
void* __sized_by(N) malloc(size_t N);
or fields in structs, where
struct S {
int * __counted_by(10) buffer;
};
is not the same as
struct S {
int buffer[10];
};
and other things just aren’t representable
struct S {
int N;
T* buffer __counted_by(N);
}
Arguably it shouldn’t be a stretch to make flexible struct arrays a real syntax, e.g.
struct ArrayThing {
int N;
Thing things[N]; // Where currently you would just have []
};
But that kind of change still wouldn’t allow
struct Matrix {
int width;
int height;
float values[width * height]; // or a out of band pointer, etc
}
This RFC is designed to be applicable to existing C codebases, and more importantly existing C ABI, so it has to be general and support the many absurd things that happen in existing code, so the bounding expression is a more or less arbitrary, and can reference members, etc.
Cool! This looks helpful and pretty lightweight to use.
They talk about inserting checks when you dereference an annotated pointer, and when you initialize one. (I might have missed this but) does it do the same checks when calling a function with annotated parameters? Is that implied because it’s a form of initialization?
I’m not sure what you’re asking, but if you’re asking what happens with
// using this syntax because people keep suggesting it instead of the annotations but
// the RFC interprets this the obvious way
int f(int N, int buffer[N]);
int g(int M, int buffer[M], int * __indexable widePointer) {
f(M, buffer); // in principle this will emit a check that M <= M before calling
f(5, buffer); // would check 5 <= M
f(5, widePointer); // would check that widePointer.end - widePointer.start >= 5
// ...
}
Basically, no function can check that the bounds and values passed to it are correct, only that it itself never exceeds those bounds, hence the caller is necessarily responsible for checking.
Ah ok: I was wondering whether it’s the compiler, or the programmer, who’s responsible for putting in those checks at the call site. And the answer is: the compiler.
The compiler checks both sides of f’s interface:
the callee can’t access out-of-bounds elements
the caller can’t pass too-long sizes or too-short arrays
In the interests of disclosure, while I’m not the author of this RFC, I do work with them. My comments are my own.
Cool. Is Apple funding this?
It’s been done by Apple :D
Towards the end of the document there’s the proposed upstreaming.
Hopefully this will start to comvince Apple to invest in CHERI since it moves the narrative from the overhead of CHERI (pointers get twice as big) to something that’s a net win on every dimension!
When framed like this, adding a successor to Morello in Apple Silicon is such an obvious win that I want to meet the VP who can convincingly argue against it.
The whole point here is that pointers don’t increase in size in this model.
While I recognize that the code size will be reduced somewhat, I’m not really sold on CHERI, I feel the ABI is expensive and the checks required seem to unavoidably burn resources (silicon, predictor state, I guess retire buffers, etc). On the other hand things like pointer auth, MTE, etc all burn resources as well - I especially love the “low level” programmers insisting that not having bounds or type safety for performance, while ignoring the cost of all the things trying to make the system secure in the face of the inevitable errors :-/
I misunderstood, there was a picture of a pointer expanded to base, top, and address. I guess that’s only within a region, not across API boundaries?
Silicon is a bit broad, but the area in the core for CHERI is largely the extra size of rename registers. You need a couple of adders and some comparators in the load-store unit, but that’s basically noise for a modern architecture. Not sure what you mean by predictor state, you can just predict addresses from the branch predictor though for side channel mitigation you might want to do a bit more. Morello performance suffers because they don’t predict on any operation that can change the bounds because they didn’t want to change the floorplan of the N1 and widening the path from the branch predictor to the instruction fetch unit was impossible. No change to retire buffers except for some of the experimental stuff for temporal safety. The other big source of overhead on Morello is that they didn’t widen the store queues so store-pair on capabilities consumes two store queue entries. Widening the queues is usually much cheaper than lengthening them (control logic complexity scales with length, widening the entries is just more flip flops).
On modern superscalar architectures, register rename is the most annoying component because it is complex and must be powered any time that instructions are executing. CHERI reduces register rename pressure relative to most software mitigations (bounds and permissions are carried with addresses in registers) and so may cause a power reduction in some workloads.
Pointer auth requires a lot more logic to implement the checks than CHERI does. MTE requires a lot more invasive changes to the pipeline than CHERI does. MTE also increases the memory subsystem complexity by more than CHERI, though ideally we’d have both.
If you compose CHERI and MTE as proposed by my team, you get some nice properties. MTE doesn’t need to enforce spatial safety and MTE colour values are not secrets (it’s trivial to leak MTE tags via speculative side channels, so MTE provides no protection against an attacker who can inject pointers and has access to a single spectre gadget). This means that your allocator can assign MTE tags predictably.
You can also avoid a lot of the overheads of MTE (when combined with CHERI) by having a mode where loads trap and stores fizzle. Loads trapping has a small increase on retire buffer size but you can’t fully retire a load until the data is available anyway so this is negligible. When you store, you send the tags through the store queue and simply discard it once you have loaded the tags from memory (this benefits a lot from the Arm memory model). You can’t do that without CHERI because it prevents MTE from catching large linear overflows that go past the next object into another one, but linear overflows are a local property and so using a global mechanism to check them is a massive waste of your power and complexity budget.
Microarchitects building high-performance chips that I’ve talked to believe that the total overhead of MTE in this model would be around 2%. From the perspective of the C abstract machine, loading from a deallocated object traps, storing to one stores to the old one and you have no mechanism to load that result and so you can’t observe that it didn’t occur. This would let us reduce the frequency of the revocation sweeps for temporal safety by a factor of 15, which should make the total cost of complete spatial and temporal safety 5-10% for most workloads. You’re paying more than that for weak probabilistic mitigations today. For a lot of short-lived programs (e.g. clang), this means that the frequency of the revocation sweep would be less than the run time of the program and so you’d be able to just use MTE for temporal safety.
I’d take that further. Systems programmers often build complex things to work around the fact that they don’t have memory safety. The dances that iOS does with process isolation and (buggy, insecure) serialisation and deserialisation routines are a good example of this. They cost orders of magnitude more than doing the same thing with CHERI.
With CHERIoT, we’ve shown that you can build interesting things if the lowest-level code in the system can rely on memory safety. The TCB for confidentiality and availability in our system is <350 instructions (i.e. smaller than the unverified bit of seL4, and we have some folks starting to formally verify it). The sealing mechanism means that the scheduler never has access to the state of the threads that it’s managing, so is in the TCB for availability but not confidentiality or integrity. The (shared) heap allocator is in the TCB for confidentiality and integrity for heap objects. The overhead of adding a compartment is on the order of 40 bytes, so adding compartments is cheap.
The public demo that we did a few weeks ago connected to the Azure IoT hub, fetched some JavaScript bytecode, and executed it, with the TCP/IP, TLS, and MQTT libraries all in separate compartments, the JavaScript running in a compartment. Everything had complete non-bypassable, spatial and temporal memory safety, in 256 KiB of SRAM.
You can’t build something like this if memory safety is a best-effort thing enforced on cooperating code. Personally, I consider the fact that we deterministically mitigate around 70% of vulnerabilities in C/C++ code a nice side benefit. The thing I really care about is that we provide a path to using C/C++ libraries in safe-language programs without bugs in those libraries being able to compromise the high-level abstract machines. Web browsers would have evolved very differently if we’d had CHERI in the ’90s and things like ActiveX could have been made secure with minimal overhead.
Correct - there are explicit annotations to get wide pointers (search indexable, bidi_indexable). Things that aren’t exposed to the ABI (locals, etc) are logically wide pointers.
For you CHERI comments I don’t do silicon design, but I’ll assume your position is correct in terms of design complexity.
What has been your experience compiling existing C/C++ projects? How much code churn has been required to deal with the change in pointer size? My experience implementing pointer auth means I’m somewhat cynical about how such code treats pointers vs data as interchangeable.
I did a load of work to minimise this and then Alex and Jess did more. In particular, the semantics of things like intptr_t and unions of those and pointers just work as you’d expect (which is not the case with pointer signing). The porting difficulty changes with how low level the code is. For application code, it’s typically under 0.02% of lines. For systems code (kernels, language runtimes), it’s around 0.2%. For JIT compilers it’s around 2%.
On Morello, we’re able to run FreeBSD (kernel and userland), Weston (including GPU drivers), KDE, and a load of applications all in pure capability mode (every pointer is a CHERI capability). A lot of the KDE libraries and applications just worked with no source code modification. A lot of the changes were easy to upstream because they are just using the correct C types for things. The Chromium port is nearly finished (WebKit was ported with a two versions of JavaScriptCore, one that used capabilities for JavaScript pointers and one that used integer offsets within the heap: this was the 2% change).
On CHERIoT, we run the TPM reference stack, mBedTLS, the FreeRTOS network stack and MQTT library, and the Microvium JavaScript interpreter. None of these requires any changes to get full spatial and temporal memory safety. We did need a few small tweaks to make a few things run on RISC-V (per-architecture defines) and a few more to compartmentalise a few of the components.
To add to that, the CHERI port of FreeBSD currently ships around 8,000 packages with the pure-cap ABI and 20,000 with the default AArch64 ABI. This is out of a total of 30,000 that build on x86. Getting to this level was a few (under 10) person years of work. A lot of the missing ones are blocked on dependencies and often fixing one then causes a few hundred to build. I believe we’ll have approximate parity with AArch64 on Morello in a year with the current investment in people. The Chromium port is being done by a single person. He’s spent less than six months on it and (I think) has everything except v8 finished (I expect v8 to be the most work).
If someone decided to ship CHERI hardware for a *NIX ecosystem today then it is quite feasible to get everything ported to the new ABI before the silicon ships. The difficulty for Apple would be squeezing the third party ecosystem into the gap between announcing and shipping the hardware, but I think that’s feasible. When you need someone to lead that effort across your hardware/software teams, your recruiters have my contact details…
Can’t we just use existing syntax? Something similar to
Perhaps replace
static
with some other or combination of keywords.That’s invalid under current C syntax, as N has to be declared prior to use - this issue is specifically addressed in the RFC:
void foo(size_t N, int ptr[N]);
works as expected and is equivalent toint *__counted_by(N) ptr
But it’s also important to recognize that that syntax is insufficient. It doesn’t support anything other than referencing another value by name, e.g. it can’t do:
void foo(int N, int M, int buffer[N * M])
void foo(struct SomeThing S, int buffer[S.count])
or any other non-trivial expressions that are necessary to accurately reflect real world APIs.
That syntax also only works for parameters, and cannot express bounds of return values, fields, globals, locals, etc.
This RFC is a single system, that applies consistently to all pointers in all cases, and most importantly handles all the cases we see in real world C APIs.
I understand the current limitations on the semantics. In order to make it widely adoptable, it would be better to adopt the currently available syntax with a new keyword, instead of introducing a new syntax. Something like
fits in the current syntax.
Why are we adding __Bound here? it isn’t unnecessary.
If you have
It does not need an additional annotation. That just works as you would expect. The only time you need to explicitly use these attributes is if
The attributes are needed for all the cases where that syntax does not work.
For example, you can’t make
work with the existing syntax because you break existing compilers in a way that can’t easily be opted out of via a macro.
The same happens with the gcc extension to pre-declare parameters
You can’t easily make that work via macros in compilers that don’t support gcc’s extension.
But you’re also focusing on just parameters, whereas this set of attributes applies to the type, and can be placed more or less anywhere a type is used, for example, the fairly critical case of a return value
or fields in structs, where
is not the same as
and other things just aren’t representable
struct S { int N; T* buffer __counted_by(N); }
Arguably it shouldn’t be a stretch to make flexible struct arrays a real syntax, e.g.
struct ArrayThing { int N; Thing things[N]; // Where currently you would just have [] };
But that kind of change still wouldn’t allow
struct Matrix { int width; int height; float values[width * height]; // or a out of band pointer, etc }
This RFC is designed to be applicable to existing C codebases, and more importantly existing C ABI, so it has to be general and support the many absurd things that happen in existing code, so the bounding expression is a more or less arbitrary, and can reference members, etc.
Cool! This looks helpful and pretty lightweight to use.
They talk about inserting checks when you dereference an annotated pointer, and when you initialize one. (I might have missed this but) does it do the same checks when calling a function with annotated parameters? Is that implied because it’s a form of initialization?
I’m not sure what you’re asking, but if you’re asking what happens with
Basically, no function can check that the bounds and values passed to it are correct, only that it itself never exceeds those bounds, hence the caller is necessarily responsible for checking.
Ah ok: I was wondering whether it’s the compiler, or the programmer, who’s responsible for putting in those checks at the call site. And the answer is: the compiler.
The compiler checks both sides of f’s interface:
Thanks for the example!