1. 24

From the abstract:

We are accustomed to thinking of computers as fail-stop, especially the cores that execute instructions, and most system software implicitly relies on that assumption. During most of the VLSI era, processors that passed manufacturing tests and were operated within specifications have insulated us from this fiction. As fabrication pushes towards smaller feature sizes and more elaborate computational structures, and as increasingly specialized instruction-silicon pairings are introduced to improve performance, we have observed ephemeral computational errors that were not detected during manufacturing tests. These defects cannot always be mitigated by techniques such as microcode updates, and may be correlated to specific components within the processor, allowing small code changes to effect large shifts in reliability. Worse, these failures are often “silent” – the only symptom is an erroneous computation.

  1. 8

    This is utterly terrifying, and makes me wish I followed through instead of botching up a slow running experiment many years ago (mid-late 00s) through a cocktail of poor data collection / management practices / losing interest.

    The gist of it was exploring SDC (silent data corruption, as the article considers this problem a symptom of). I had a consumer hardware grade NAS (raid-3) with my backed up DVDs after losing one too environmental causes. These were chunked and checksummed on backup, re-reads were verified and the checksums were stored on a thumbdrive.

    Every oh so often, a random video file would be selected, copied to another drive and then removed from its source. The copy was to ensure a trip across the bus and back - no filesystem tricks. On a year to year basis, I pulled up the thumbdrive and re-ran the checksums. Every year there were new mismatches.

    Memory is a bit hazy, but over the span of 4 years had something around of 5% failures. I also lost the thumbdrive and the whole thing turned into an exercise in experiment design failures.

    1. 3

      This paper is wild.

      0.1% silent CPU failure in production systems is far higher than I would have predicted.

      1. 2

        Should’ve clicked on this earlier. Didn’t expect this to be about invalid CPU cores and rather some benchmarking.

        A deterministic AES mis-computation, which was “self-inverting”: encrypting and decrypting on the same core yielded the identity function, but decryption elsewhere yielded gibberish.

        That sounds neat in production