Morello was an amazing piece of engineering. The Arm folks took an existing performance-optimised core (Neoverse N1) and, with an unbelievably short timeline, retrofitted CHERI to it. This had a couple of unfortunate effects:
There were some things that could not be changed without affecting the floorplan of the N1.
They didn’t do the normal step of performance tuning where they run some representative workloads and tune queue sizes and so on.
This report shows their analysis of where the perf overhead on Morello comes from. Most of it is from two places. The branch predictor doesn’t predict new bounds (doing so required widening the path from the branch predictor to the fetch unit, which affected floorplan) and so capability jumps are expensive. Store-pair instructions are split into two store queue entries when they’re capabilities, which means that you often stall from the store queue filling up when spilling registers on recursive calls. This is easy to fix by widening the queues so that they fit 32 bytes of data not 16.
The real take-home here is that Arm’s architects believe that a 2-3% perf overhead from CHERI is attainable with a first- or second-generation production-quality implementation.
Morello was an amazing piece of engineering. The Arm folks took an existing performance-optimised core (Neoverse N1) and, with an unbelievably short timeline, retrofitted CHERI to it. This had a couple of unfortunate effects:
This report shows their analysis of where the perf overhead on Morello comes from. Most of it is from two places. The branch predictor doesn’t predict new bounds (doing so required widening the path from the branch predictor to the fetch unit, which affected floorplan) and so capability jumps are expensive. Store-pair instructions are split into two store queue entries when they’re capabilities, which means that you often stall from the store queue filling up when spilling registers on recursive calls. This is easy to fix by widening the queues so that they fit 32 bytes of data not 16.
The real take-home here is that Arm’s architects believe that a 2-3% perf overhead from CHERI is attainable with a first- or second-generation production-quality implementation.