I worked with Saar and Nico on this, happy to answer any CHERI-related questions.
IBM has been using ECC syndrome to sneak in a memory tag bit for each 16 bytes of RAM for decades (see my article here) with no overhead. Have you given any thought to using ECC in this way?
Outstanding work by the way.
We have considered it. There are three downsides:
The last is my main reason for not liking that approach. With the temporal safety work, for example, it’s useful to be able to quickly skim past cache lines that don’t contain tags. You can do this if the lines are not in cache yet and you have a hierarchical tag cache design that stores the tags off to one side and can pull in an entire page (or at least half a page) in a single DRAM read. You can then prefetch only the lines that have capabilities. Similarly, in the Morello mode that we use for a read-pointer barrier in concurrent revocation, if the tag cache can quickly reply with ‘no tag here’ even while waiting for the data from DRAM then you can potentially move a load further along the pipeline (you don’t know what the data is, but you know it won’t trap [unless you have ECC with precise traps turned on]).
ECC bits are also not free: you’re still consuming die area and power for them. We’ve found that roughly 80% of pages in a pure-capability system have no tag bits. That’s 32 bytes of ECC bits that you’d need to power, even for pages that aren’t using tags. With a hierarchical tag cache, you can avoid allocating tag storage space entirely for pages (or some other granule) that don’t store capabilities (or MTE colours, or anything else that wants to use physically indexed metadata).
As an aside, IBM’s POWER9 user manual states that their memory controller, which uses standard ECC DIMMs, does “64-byte memory ECC” and supports “correction of up to one symbol in a known location plus up to two unknown symbol errors” (page 186 of this). Reading between the lines, I interpret this (“in a known location”) as a reference to erasure coding as opposed to error correction coding. The idea seems to be that from an information theory perspective, recovering a bit that you know you don’t know the value of (i.e. a tag bit) is less of an ask than correcting bits when you don’t know which bit might have been corrupted. Though my understanding of information theory here is nonexistent and I could be wrong.
I can’t really see IBM compromising on RAS since it’s a specific emphasis of their platform, so it seems like they have some way to do it without compromising what ECC offers. I could see it involving larger read sizes though.
What you write about tag scanning is very interesting though. The idea of being able to grab a whole page’s worth of tag bits for “GC” purposes certainly sounds like a worthwhile tradeoff — interesting stuff.
The ECC scheme is very tightly coupled with the memory. Memory ECC schemes are biased towards the failure modes that they expect. In the simple case, bit flips from charged to discharged are more likely than the opposite but now that memory cells are so small there’s a lot more subtlety in the specifics of individual fabrication techniques. It quite possible to design an ECC scheme that is incredibly robust in the presence of random errors and happens to do incredibly badly in the specific case of one vendor’s memory technology, whose most common failure mode hits the weakest point in the ECC scheme’s space.
I don’t know anything specifically about IBM, but given their mainframe background, I suspect that they tightly couple their memory controller design to a specific memory technology for any given system generation and so can bias their ECC scheme aggressively. I also wouldn’t be surprised if ECC at the 64-byte granularity is just the first tier in their memory integrity scheme. I know that on some systems they do RAID-5-like striping for memory, they may also keep some coarser-grained error correction metadata that they can hit on a slow path if ECC reports uncorrectable errors.
Note that revocation isn’t quite the same as GC, it’s the logical dual. GC guarantees that deallocation doesn’t happen until all pointers have gone away. Revocation ensures that all pointers have gone away as a result of deallocation. We can do this accurately with CHERI because the tag bit lets us accurately identify pointers.