To my money, the author does not actually understand the CAP theorem: anything that prevent access to the data is a partition of the system. Case of partitions include at least operators' faults, bugs or power issues in the replicas, problems that occurred only once, even many user issues (eg hardware issues).
Also, the fact that developers didn’t handle outage exceptions in their code, simply means that they did not need either availability or consistency in the first place.
Nevertheless, it might be interesting as an highly available storage.
The author is responsible for the CAP theorem.
Which is an opportunity for wry reflection.
Still I’d argue the theorem is about the data, not about the nodes storing/producing/consuming/distributing them.
It is the data system that can be consistent, available or partition tollerant, and from this perpective an operator error is not different from a network failure: in both case the client node cannot access the data it needs (thus the system is partitioned), loosing availability.
I feel like the author was quite clear that they were describing the difference between the theorem’s formal requirements and what’s important in practice. They acknowledge that partitions can happen in this strict sense:
The purist answer is “no” because partitions can happen and in fact have happened at Google, and during
(some) partitions, Spanner chooses C and forfeits A. It is technically a CP system. We explore the impact
of partitions below.
I think the rich irony comes from the fact that Brewer and many distsys practitioner disciples have used the theorem to, to put it indelicately, poop on practitioners that have minimized the impact of P. And now comes Brewer himself, setting aside robe and mitre!
I can’t disagree with that.
Yes I read that, but what sounded me as a misunderstanding of the theorem was that only network related issues were considered “partitions”.
Instead a partition simply is the condition of a chunk of shared data that cannot be obtained by a part of the system (even a single node) that have requested it. Wherever the whole system is down for a bug or a whole set of routers have been misconfigured, it doesn’t matter: there is a partition in the system from the theorem point of view.
Well, yes, I’m probably a purist when it comes to technical jargon…
But, actually, I should have read the author name before posting. Lesson learned: Google can hire them all! :-)
That’s not the right definition. A partition is the loss of one or more network messages. It’s not a data-centric condition at all.
To support this (because I see a -1 incorrect vote), here’s section 2.3 from Gilbert and Lynch’s paper:
“In order to model partition tolerance, the network
will be allowed to lose arbitrarily many messages sent from one node to
another. When a network is partitioned, all messages sent from nodes in
one component of the partition to nodes in another component are lost.”