This is about the Tandem NonStop architecture that gives five 9’s availability with linear scaling. Think Erlang’s benefits at hardware and OS levels. The neat thing about this document was how they systematically looked at how failures occurred to block as many as possible by design. I keep sharing the old stuff since its patents are probably expired but techniques are still useful. An open clone of NonStop with commodity parts could be awesome. Port a high-availability microkernel like QNX or Minix 3 with runtime like Erlang’s and/or Ada/SPARK for possibly higher availability.
One thing to add: An interesting data-collection aspect of this paper, which few other systems papers enjoy, is a very good real-world corpus arising almost automatically because of the nature of the problem. Tandem advertised a failure-tolerant system with long MTBF, and sold it (only) to enterprise customers, with that promise, and with a strong service agreement. Therefore, if the system failed, customers naturally would normally report it, since system failure was considered a “shouldn’t happen” type of event. With some exceptions; the paper estimates about half of total failures for any reason were reported, presumably excluding things where the customer immediately knew the reason for the failure (e.g. misconfiguration on their part) so didn’t consider it worth invoking the support agreement. Gray was furthermore given access to the detailed internal failure logs of four large customers, which added an extra level of data to check the broad all-customers failure data they had.
“Tandem advertised a failure-tolerant system with long MTBF, and sold it (only) to enterprise customers, with that promise, and with a strong service agreement. Therefore, if the system failed, customers naturally would normally report it”
Absolutely! Loved that. It’s like an effective, paid version of the many eyeballs argument for improving debugging aspect of development. When I read it, I was tossing ideas around in my head about how to create that for FOSS OS or software. It would start with paid FOSS for businesses w/ support & plenty logging. Gray’s results are sort of a corroboration for the idea of instrumenting the heck out of your apps where they log or report almost anything anomalous, even internally, so it gets back to developer. Maybe make it easier with assertions left in or tools like Galois’s Copilot. It’s also one of the hard parts of introducing new, high-integrity systems into enterprises where they’ll be used and abused in who knows what manner. The incumbents with their years to decades of deployments have received enough questions and error reports about the operating environments to get the systems as stable as they are in general case w/ mods for specific, weird situations. The new player has to learn all that the hard way unless they can design their way out of it somehow. Only partially I bet.
I’d like to have data this thorough on Linux and the BSD’s. We already have theoretically thorough stuff like how the Micro-Restarts paper looked at effect of technique on problems in every layer of the stack. I want more empirical data, though, on what happens to internal or user errors to see what mitigations need to go where to knock out about every problem. Maybe Red Hat, SUSE, or proprietary UNIX vendors collect or share such things. I’ve always thought they’re too complex to fix that thoroughly, though, with better odds on building on something like MINIX 3 or QNX + SQLite-style storage + VMS-style clustering w/ diverse, reliable hardware.
Very interesting paper; thanks for digging it up. Incidentally, its author, Jim Gray, is well known for a variety of other contributions to computer science as well (especially databases), for example as one of the people involved in developing the ACID concept (he proposed the ACD parts).
Yeah, he did some amazing work. The Wikipedia on Tandem is actually pretty good at summarizing the designs and some tradeoffs they made over time with references. Bitsavers still has most of the guides on the architecture, administration, etc. Far as these papers that include hardware, the best other one was by competitor at Stratus that kind of exhaustively summarizes the subject:
http://ftp.stratus.com/vos/doc/papers/RobustProgramming.ps
Note: PDF copy is gone w/ no Wayback Machine backup. Only link I found.