Sure, but almost all of your DIMMs will never experience any errors:
Across the entire fleet, 8.2% of all DIMMs are affected by correctable errors
And even within the subset of DIMMs that do error, a few bad DIMMs account for almost all the errors (figure 2 in the linked paper). They also find that number of errors increases over time up to a plateau (fig 10).
So I think the take-away for most people should be: you should test your RAM, and maybe you should test it again after a year or two.
I don’t have any data to hand, but until a bit more than a year ago I worked at a smaller cloud provider and I recall the DDR4 failure rates were even higher than DDR3. We had ECC, so we would at least find out – machines would conk out across the fleet due to uncorrectable failures way more than was acceptable. Nothing about the trajectory of DRAM reliability inspires any confidence at all.
But what is clear is one ought to ignore anybody pushing the idea that ECC isn’t the absolute minimum level of protection needed. Run fast and far away.
That’s over 10 memory corruptions per DIMM per day, non-corrected on non-ECC memory.
Remember this next time you’re considering non-ECC DRAM.
Sure, but almost all of your DIMMs will never experience any errors:
And even within the subset of DIMMs that do error, a few bad DIMMs account for almost all the errors (figure 2 in the linked paper). They also find that number of errors increases over time up to a plateau (fig 10).
So I think the take-away for most people should be: you should test your RAM, and maybe you should test it again after a year or two.
An article from 2009? Here is an update from 2013.
I’m sure there are more recent articles.
I don’t have any data to hand, but until a bit more than a year ago I worked at a smaller cloud provider and I recall the DDR4 failure rates were even higher than DDR3. We had ECC, so we would at least find out – machines would conk out across the fleet due to uncorrectable failures way more than was acceptable. Nothing about the trajectory of DRAM reliability inspires any confidence at all.
But what is clear is one ought to ignore anybody pushing the idea that ECC isn’t the absolute minimum level of protection needed. Run fast and far away.
https://youtu.be/fE2KDzZaxvE?t=2102 Good talk overall, URL with timestamp where it’s about DRAM