1. 99
  1. 29

    This reminds me of a time when, as a sales engineer in the mid 2000s, I was flown out to St. Louis to rescue a software contract renewal. Our customer couldn’t get our software to work. It seemed to have broken the ERP software it was integrated into. The account manager and I were given a pleasant welcome in a nice conference room from which we could see the Gateway Arch. Nevertheless, we were given a firm but understandable, “No,” unless I could fix the problem.

    So I left the comfortable conference room for their cramped, windowless IT department downstairs where I, in my obligatory suit and tie, was subjected to stares of incredulity. “Did it work when it was first installed?” I asked. “Yes,” they said. “When did it stop working?” “When we upgraded our computers last month.” Like them, I assumed the problem was some kind of version incompatibility. But the OS was the same version; their ERP version hadn’t changed; and our software version hadn’t been changed. Only the hardware was different.

    On a lark, I plugged an old keyboard into one of the new computers. The forms in the ERP started working again. Then I plugged a new keyboard into an old computer and the forms broke. As it turned out, the new keyboards were entering spurious invisible characters, which explained some other problems they’d been having, too. The purchasing manager frowned at the prospect of having to return a few hundred keyboards, but, to her credit, she didn’t shoot the messenger. She renewed our contract.

    My problem was nowhere near as hard to track down or fix as the hard drive resonance problem in this article. It only struck me because I had to travel a long way to be reminded that, every once in a while, the imperfect physical world shatters the illusion of a perfect abstraction in which only the software can fail.

    1. 22

      This reminds me of Gödel, Escher, Bach where Hofstadter describes how there can’t be a perfect record player, because for each player there is a record that could either destroy the player using resonant frequencies, or the player could not play that record properly.

      1. 18

        I wish someone who debugged this would give more of an explanation for the crash. Given the era I presume this meant the dreaded blue screen. But why was the system unable to deal with a faulty hard drive? Did a timed out read in a VMM path cause a blue screen? I’ve blasted Linux for running too long with a faulty drive that was spewing IO errors into the logs for months without me noticing. Maybe the drive had a second resonant frequency that caused a perpetual delay such that even retires failed? I would love a concrete description from the windows OS dev perspective.

        1. 8

          It depends on where the failure happens. One key difference between the NT and Windows kernels is that almost all kernel memory in NT is pageable. If you hit a timeout trying to bring a kernel page back into memory that’s needed on an interrupt-handling path or to release a lock, then you may well crash. I suspect that, if it’s caused by resonance, the drive might report a read error because the head jittered during the read, rather than reporting a recoverable error. In this case, you’d be unable to satisfy the page fault and have no option but to die.

        2. 12
          1. 12

            Back in the 90s when I mostly worked with hardware, I used to play Reckoning Day by Megadeth to test sound cards, after discovering that the opening few seconds would quite often blow the onboard amp.

            1. 5

              “What happened to the amp?”

              “I shredded it.”

              1. 0

                Brutal.

              2. 6

                This is the interesting thing about performance and security issues: quite often they break through abstraction boundaries.

                1. 4

                  Reminds me of the long list of acoustic attacks on hard drives with the most amusing demonstration being: https://www.youtube.com/watch?v=tDacjrSCeq4

                  1. 9

                    That was linked in the article, btw.