To be blunt: I have not seen any evidence that Antirez is qualified to implement a distributed system. I have seen plenty of evidence that he isn’t. He is clearly a very smart individual and his understanding of writing highly performance code is top-notch. But I just don’t think he understands distributed systems and how to approach them.
What’s such a shame is that what you have to do is exactly what you said: implement. Why keep re-creating your own versions of these things when it keeps being demonstrated over and over that they’re faulty? This tweetstorm sums it up:
Seriously, skim the VR Revisited paper, then read http://antirez.com/news/80 . These sorts of issues were solved decades ago. … and until your database authors stop making up homegrown replication and consensus protocols, you’re going to keep seeing these bugs.
Starting with a wiped data set is a byzantine failure, and Redis Sentinel is not able to recover from this problem.
No, it’s not. You don’t need Byzantine fault tolerance to prevent this failure. From Understanding Fault-Tolerant Distributed Systems, Cristian, 1993:
An amnesia-crash occurs when the server restarts in a predefined initial state that does not depend on the inputs seen before the crash.
As a thought experiment, consider a Redis-alike in which the master keeps an in-memory counter of the number of writes it’s accepted. A replica, when connecting to the master, receives this counter as part of the full sync. If the master’s counter is less than the replica’s counter, then replica refuses to sync and/or triggers a leader election. This is not a Byzantine fault tolerant system (in that it cannot tolerate arbitrary faults), and yet it tolerates amnesia-crash faults.
Redis’s replication protocol design—if persistence is disabled—means that Redis has literally no fault tolerance at all. If the master dies and restarts within the Sentinel liveness checks, all data in the cluster evaporates. This is baffling.