Here is a summarized version of the abstract.
Main memory is a leading hardware cause of machine crashes in today’s datacenters. Designing, evaluating and modeling systems that are resilient against memory errors requires a good understanding of DRAM errors in the field. The few studies of DRAM errors in production systems have been too limited in data set size or granularity to answer many open questions on DRAM errors, e.g. the prevalence of soft errors compared to hard errors, or analysis of hard error patterns.
In this paper, we study data on DRAM errors collected on a diverse range of production systems covering nearly 300 terabyte-years of main memory. We provide a detailed analytical study of DRAM error characteristics, including hard and soft errors. We identify many promising directions for designing more resilient systems and evaluate the potential of different protection mechanisms in the light of realistic error patterns. One of our findings is that simple page retirement policies might be able to mask a large number of DRAM errors in production systems while sacrificing only a negligible fraction of the total DRAM in the system.