1. 16

  2. 6

    I’m confused: WAS this a Byzantine fault? The post only describes a network partition: every node appears to have followed the Raft protocol correctly. That’s a normal behavior of asynchronous networks, not a Byzantine failure. It sounds like the partition interfered with Raft elections, but didn’t actually cause a safety violation.

    In the RAFT protocol, cluster members are assumed to be either available or unavailable, and to provide accurate information or none at all.

    This is sort of… half-true. Raft is designed to preserve safety under any asynchronous network conditions, including partitions like this one, and they haven’t described any kind of safety violation. Like all consensus systems, network partitions can interfere with availability, and during a partial network partition like this one you can wind up with less-than-ideal availability. Now… the Raft paper does say:

    [Nodes] are fully functional (available) as long as any majority of the servers are operational and can communicate with each other and with clients.

    And this claim is violated here! At the same time, we should understand that claim in the context of the paper’s repeated cautions around availability:

    … availability (the ability of the system to respond to clients in a timely manner) must inevitably depend on timing. For example, if message exchanges take longer than the typical time between server crashes, candidates will not stay up long enough to win an election; without a steady leader, Raft cannot make progress.

    Regardless, we do have what looks like a violation of the majority-liveness claim: it sounds like a partially isolated node can rapidly advance its own epoch through elections, forcing a well-connected majority component to go down. Curiously, there ARE mechanisms in the Raft paper which are specifically intended to address this:

    To prevent this problem, servers disregard RequestVote RPCs when they believe a current leader exists. Specifically, if a server receives a RequestVote RPC within the minimum election timeout of hearing from a current leader, it does not update its term or grant its vote. This does not affect normal elections, where each server waits at least a minimum election timeout before starting an election. However, it helps avoid disruptions from removed servers: if a leader is able to get heartbeats to its cluster, then it will not be deposed by larger term numbers.

    Does this mechanism fail to address this partial-partition scenario? Did etcd opt not to implement it? Or could there be a bug in etcd? Worth investigating!

    1. 1

      Amazing article for many reasons. Failure scenarios in the real world are much more complicated than the consistency algorithms assume. I think for the ongoing research efforts into this domain it would be extremely valuable to build a modeling framework and reproduce all the outages that are accessible online like this one as a test case. FoundationDB did something similar and their DB is one of the most reliable out there.