This reminds me of a comment by Cass Everitt, one of the authors of the original azdo presentation:
More parts are likely to become programmable over time, but for both power and performance, there’s also a desire to consolidate common idioms back into fixed-function hardware. The trend isn’t monotonic.
I will add: while it’s true that doing things in hardware profits mainly by eliminating dispatch, it is important not to minimise that. Dispatch is probably second only to memory as the primary bottleneck for most programs.
It’s definitely not monotonic, historically it’s been a pendulum:
Thing X becomes possible in software but too slow.
Dedicated hardware appears that makes X fast.
General-purpose cores become fast enough that you can do X in software, with more flexibility.
Thing X becomes a large fraction of the total work done on a CPU.
Dedicated hardware appears to do X more efficiently.
This often then repeats from step 3. A few variations include ‘thing X is superceded by thing Y, dedicated hardware goes away’ and ‘thing X is sufficiently similar to things Y and Z that you can have instructions that accelerate all three and are cheaper than the dedicated hardware for X’. For crypto it’s particularly interesting. Most modern ISAs have instructions for AES not because they’re particularly fast (vectorised software AES is already very fast) but because they are not vulnerable to timing side channels and it’s hard to write a software implementation with this property (at least, a fast one).
There’s one other reason for doing things in hardware: atomicity. CHERI provides bounds checks, which are simply a case of removing dispatch overhead (every load / store does bounds checks, no extra instructions, no extra rename registers, just some ALUs off to the side in the load-store pipeline). You can do that part purely in software (Apple’s Firebloom does). It also gives pointer provenance, which requires associating a metadata bit with the data to differentiate between capabilities (pointers) and other data. You can do that part fairly easily in software for a single-threaded environment. Once you add multiple threads, it’s hard. You need to ensure that every data store atomically zeroes the tag bit and any capability store propagates the tag bit from the in-register value. This is something that’s very difficult to assemble from simpler building blocks.
Trust is another axis. Everything above the hardware trusts the hardware. Some of it doesn’t trust anything else. You can use software fault isolation to build a process-like abstraction but that puts (at least part of) the compiler in your TCB and requires that everything that you run be built with a trusted compiler. Or you can provide an MMU in the hardware. The MMU is a bit faster than SFI, but not that much (a few percent), and it’s not clear what a fair comparison of an MMU versus a CPU with no MMU running SFI code would look like (with the die area used for the TLB and page-table walker given over to extra execution units) and which would win. Providing a simple security mechanism (with configurable policies) in the hardware and being able to trust that it is unconditionally enforced irrespective of what software is running is very useful.
This reminds me of a comment by Cass Everitt, one of the authors of the original azdo presentation:
I will add: while it’s true that doing things in hardware profits mainly by eliminating dispatch, it is important not to minimise that. Dispatch is probably second only to memory as the primary bottleneck for most programs.
It’s definitely not monotonic, historically it’s been a pendulum:
This often then repeats from step 3. A few variations include ‘thing X is superceded by thing Y, dedicated hardware goes away’ and ‘thing X is sufficiently similar to things Y and Z that you can have instructions that accelerate all three and are cheaper than the dedicated hardware for X’. For crypto it’s particularly interesting. Most modern ISAs have instructions for AES not because they’re particularly fast (vectorised software AES is already very fast) but because they are not vulnerable to timing side channels and it’s hard to write a software implementation with this property (at least, a fast one).
There’s one other reason for doing things in hardware: atomicity. CHERI provides bounds checks, which are simply a case of removing dispatch overhead (every load / store does bounds checks, no extra instructions, no extra rename registers, just some ALUs off to the side in the load-store pipeline). You can do that part purely in software (Apple’s Firebloom does). It also gives pointer provenance, which requires associating a metadata bit with the data to differentiate between capabilities (pointers) and other data. You can do that part fairly easily in software for a single-threaded environment. Once you add multiple threads, it’s hard. You need to ensure that every data store atomically zeroes the tag bit and any capability store propagates the tag bit from the in-register value. This is something that’s very difficult to assemble from simpler building blocks.
Trust is another axis. Everything above the hardware trusts the hardware. Some of it doesn’t trust anything else. You can use software fault isolation to build a process-like abstraction but that puts (at least part of) the compiler in your TCB and requires that everything that you run be built with a trusted compiler. Or you can provide an MMU in the hardware. The MMU is a bit faster than SFI, but not that much (a few percent), and it’s not clear what a fair comparison of an MMU versus a CPU with no MMU running SFI code would look like (with the die area used for the TLB and page-table walker given over to extra execution units) and which would win. Providing a simple security mechanism (with configurable policies) in the hardware and being able to trust that it is unconditionally enforced irrespective of what software is running is very useful.