I love these posts. Thanks for writing it and posting it here!
A common theme in aphyr’s excellent jepsen posts+tweets is him gently asking “why don’t you use a proven consensus algorithm”?
Does anyone have any perspective on why this advice isn’t taken more seriously? Not-invented-here? Implementation complexity/too much effort? Lack of confidence/belief in the benefits and/or academia?
I’ve been guilty of similar behaviour in the past I guess (not on a distributed system, but using ordered msync() on a custom index file to emulate transactions to avoid the use of a db).
I think I thought at the time my solution was “mostly good enough” with the failure modes being unlikely, but it’s a while ago now :-)
Are there actually any good reasons to not use Paxos or Raft (and are there alternatives to these two?)
I’m not an expert, so get some corroboration here but.
But finally, the most important point is just that trade-offs are everywhere! Paxos/Raft aren’t a silver bullets—they’re slow, unstable, and only work under some assumptions that are honestly a little difficult to pull off in practice. When analyzing them it may be easy to feel that you can do better for some specific circumstance by relaxing some of their assumptions and just building your own system.
I think a lot of what Kyle’s posts have emphasized is that Paxos/Raft aren’t just a set of strange algorithms solving a theoretically interesting problem—they’re battle-hardened weapons-of-war against a remarkably difficult problem. Databases who just don’t include them think that they can go into battle with a new strategy and Jepsen does a marvelous job of throwing just the right kind of response to cower them.
So maybe what people should be realizing is that they didn’t think hard enough about their strategy to realize just exactly how difficult claims in distributed systems genuinely are. The CAP theorem talks about hard limits which are much lower than human experience (which is decidedly distributed, but we essentially spend our whole lives ignoring that) would have us believe. Paxos and Raft are slower and more fragile than you might think—but what makes you really, really think that you have a trick to getting the speed and robustness that you want?
But again, it’s all tradeoffs. Usually the big wins that Jepsen takes home are when a product’s marketing team likes to throw around technical terms it has no right to use. Actual use is probably a lot stickier—and it’s often the case that actual needs of a product may not be ACID or linearizability. But to make that tradeoff is hard and to market that tradeoff seems to be universally considered not worth the effort.
There are several good reasons. Three that come to mind from my perspective as somebody who has built several systems on these primitives:
Now, that doesn’t mean that you shouldn’t use them, just that they have high engineering costs.
are there alternatives to these two?
ZAB, Viewstamped Replication, and the many variants of Paxos (vertical, cheap, etc). There are also other design approaches, like virtual synchrony. One way to look at all of these is as different algorithms. Another way is that they are all different ways of leveraging the same core algorithm (Paxos’s synod algorithm, VR’s viewchange, etc) into real systems.
This is a huge part of it. These things are extremely difficult to reason about clearly, and it’s very easy to invent a consensus protocol that you can’t find a counter-example for. Reality, however, is much more creative.
[…] and are there alternatives to these two?
There is also Viewstamped Replication (original paper and updated) and Zab (e.g. Zookeeper, paper). And here is a comparison of Paxos / VR / Zab.
Are there actually any good reasons to not use Paxos or Raft
To your first question, I don’t think most people lack confidence in the academic side, or think these algorithms are formally incorrect. I feel they are well proven at this point.
I think the advice is taken seriously (at least I hope so) … but I think the slow progress boils down to real-world constraints, tradeoffs and implementation challenges. For example:
External systems such as Zookeeper are tricky to configure correctly, and add additional complexity to a system. Instead of deploying DatastoreX, you have to deploy DatastoreX + Zookeeper…and then keep both clusters healthy. This applies to any “external” consensus server, like Zookeeper, etcd, Logcabin, etc
These consensus algos aren’t really “drop in place”; they need to be fully integrated into the system, otherwise all kinds of weird stuff can happen. Which can make it tricky to integrate an existing consensus library into your own code (because of bog-standard, mundane reasons).
Even very clear, explicit algos like Raft have some room for interpretation, which can cause strange or undefined behavior. And parameters can be fiddly, leading to poor performance. And of course, legit bugs are always possible too, and many of these libraries are either relatively new, or implementing hairy algos like Paxos
Most of the simpler variants of consensus algos do not tolerate dynamic changes to the node list (e.g if you want to resize your cluster, you have to bounce the entire cluster). You typically need a more complex implementation to change node counts at runtime (e.g. Section 2.3, “Dynamic Paxos”, Zookeeper reconfiguration, Section 6 of Raft)
Similarly, more complicated implementations are usually needed for better performance (Section 9.3 in Raft, Fast Paxos)
It isn’t always clear what you should put through the consensus state machine, and what should be excluded. Do you send all data through the state machine? That will crush performance of the server….which is a serious problem, because many of these distributed datastores exist to address scalability and performance problems found in monolithic systems. If distributed DatastoreX is just as slow as a monolithic DB because it has to get consensus on all writes…I might as well just use the monolithic system.
Alternatively, you could maintain consensus only on metadata, or perhaps certain types of data. Or allow the user to select consistency levels per-data-type or per-operation You can build any number of scenarios, and it just places your system on a continuum…none of which are strictly “good” or “bad”.
Simple history is sometimes the blame. Project starts as something small for one task, grows, starts dealing with these types of issues, needs to retrofit a lot of older code. Hopefully the next “generation” of distributed datastores will take these considerations into account from the start, since it is a much more visible topic now.
Now, I’m not saying there isn’t a good dose of NIH syndrome, or “our algo will do just fine”, or “good enough”. That definitely happens too :)
But there are real challenges to implementing these things. Challenges which can certainly be overcome, but it isn’t easy as “just” implementing an algorithm and calling it a day.
Which is why its a good thing people like Aphyr are documenting the shortcomings of these systems, to either prompt more rigorous development, or proper documentation of the risks/tradeoffs of a system. There’s nothing wrong with being a fast, lossy datastore…as long as your users know exactly what they are buying into. :)
These are just my personal opinions about the whole ecosystem, but I think it’s a fair coverage of the realities.