1. 8

  2. 4

    I like the improvements. Replayable logs are probably a good idea in systems with redundancy. Hackers might hit the redundancy part, too. But…

    The premise feels like a strawman. Systems like Google Docs are built with a focus on features, not high reliability. Few systems in commercial or FOSS deployment use proven methods of high-reliability design. They instead use methods proven to accomplish their other goals while reducing customer-losing outages below some threshold. So, we’ll keep seeing such results.

    Anyone really testing their reliability concepts this should do it against reliability-focused deployments such as OpenVMS, NonStop, and Stratus clusters. Maybe throw in Minix 3 (drivers), SPARK Ada (deterministic components), and FoundationDB (distributed components). After architecture, there’s safe languages, static analysis, and test generators. Unlike seL4 or CompCert, these are all either commercially-developed systems or methods widely deployed in them. That proves they’re feasible for new projects.

    Systems aiming for high-reliability today should use combinations of such techniques. Maybe even the OP technique, too, since it reduces risk. They’ll get better results. In this case, I see an opportunity right where it says “our failure paths.” Quick note: always specify, analyze, and test them as much as any other code. They’re code, too.

    1. 3

      For an application, ultimately the best guarantee is to have an application-specific remote log that can be replayed to restore corrupted state.

      Another person who apparently thinks the world consists of nothing but servers. Client devices, esp. used by consumers, don’t have a devops team ready to recover state from a remote log. And for embedded devices or those without a sophisticated UI,“back up as often as possible” is unrealistic. (I’ve had some painful data loss on digital cameras, and I’ve got music gear that could lose unrepeatable moments of inspiration if their storage went tits-up.)

      It would be nice to see people actually taking steps to make mass storage more reliable instead of passing the buck.