I’ve recently wondered whether there is merit to the idea of taking the crypto acceleration instructions in most normal CPU’s and turning them into a dedicated co-processor, maybe also with some dedicated RAM. Not really for performance reasons, but rather for isolation; I’m imagining a single, in-order core with basically a single algorithm programmed into it and no bus, clock, or anything else shared with the main CPU. You can then prove that the co-processor always runs at a fixed rate for given input, know for a fact that nothing else is doing anything that can tamper with its performance (since it’s a single execution thread dedicated to a particular calculation, ideally one that is in-order), and it would (hopefully) be easier to control the side-channels one can observe from it. The main processor would not be able to ask the coprocessor about its clock rate, load/store latency, power usage, or anything else that could leak info about what it’s doing. Maybe you could even have multiple coprocessors to aid throughput, multiplexed by the OS; another process might be able to tell that you’re using a crypto coprocessor, but nothing else apart from “started using it” and “stopped using it”.
Then the rest of our programs could go off and use whatever optimizations hardware wants to implement, while the time-sensitive parts get their own dedicated sandbox.
I think a “I’m doing crypto now” mode could be more useful (fixed clocks, fixed memory access times, fixed operation times). Might slow down some algorithms, but is guaranteed to not have any timing side-channels (and maybe no power side-channels?). Would be annoying to handle this in kernel though (what do you do if a process put a core into this mode and got preempted? Do you keep that mode on? Do you turn it off and turn it back on when putting the process back on? Do you only allow mode change in the kernel to be accessed by a syscall?)
A lot of these assurances would be provided by executing your crypto with the help of a trusted execution environment or straight up a hardware security module.
Tempting, but may be costly.
If you want something generic capable of implementing many primitives, you may require quite a bit of silicone even if the pathologically straight-line code we see in crypto does not benefit from out of order execution or even a cache hierarchy. You’ll still need sizeable ALUs, and efficient multiplication (an array of 64->128 multipliers would be awfully nice), and a form of SIMD to feed all those units.
If however you want something cheap yet fast, you’ll probably need to settle on some hardware friendly primitive like Keccak. Asymmetric crypto may still be a problem though, except perhaps if we use a binary field elliptic curve (about which I’ve heard security is not as settled as it is for prime field curves). The biggest problem is it’s quite inflexible.
Unless the world comes crashing down, I don’t see hardware vendors proposing either alternative. Except for some niches, but then we already have FPGA or even ASIC implementations in specific places.