1. 4

Here is a summarized version of the abstract.

Main memory is a leading hardware cause of machine crashes in today’s datacenters. Designing, evaluating and modeling systems that are resilient against memory errors requires a good understanding of DRAM errors in the field. The few studies of DRAM errors in production systems have been too limited in data set size or granularity to answer many open questions on DRAM errors, e.g. the prevalence of soft errors compared to hard errors, or analysis of hard error patterns.
In this paper, we study data on DRAM errors collected on a diverse range of production systems covering nearly 300 terabyte-years of main memory. We provide a detailed analytical study of DRAM error characteristics, including hard and soft errors. We identify many promising directions for designing more resilient systems and evaluate the potential of different protection mechanisms in the light of realistic error patterns. One of our findings is that simple page retirement policies might be able to mask a large number of DRAM errors in production systems while sacrificing only a negligible fraction of the total DRAM in the system.


  2. 1

    I upvoted this because the lessons still haven’t been learned for many. If you read on it, cosmic rays flipping single bits are pretty much the only thing that comes up in most evaluating this risk.

    “The most aggressive policies (retiring a page immediately after justone error, or retiring whole rows and columns) are able to avoidup to 96% of all errors. “

    Obviously, anything that avoids 96% of all errors without a serious penalty is a rare find. Let’s use or build on it!

    I’ll chime in to add that approaches like NonStop still address issues like this well. They make the internal components like black boxes that should produce same output on a given input. They do voting. If a component defects, there’s automated fixes that can get it back in shape. They also have redundancy and snapshots. Such designs knock out lots of problems people would never see coming.

    I still think we need more modern designs following route of older ones like NonStop to architecturally block many issues with tactics like in this paper boosting reliability of individual components. We might also get them cheap if we use RISC boards for the individual components. On top of that, there’s lots of diversity in ecosystems such as ARM where the boards might have different implementations, process nodes, etc. The differences might reduce odds two components fail at the same time.