1. 1

      There are some points where I just don’t understand you (GC are not partitions in the CAP proof, and it’s not new: see cap faq point 16 http://henryr.github.io/cap-faq/). You can push it into it (for example by putting some deadline constraints in the definition of availability), but it’s not really CAP anymore.

      It’s easier to choose an algorithm which is safe in the general case of asynchronous networks,

      Yeah this one is interesting. I agree on the algo choice (there is no other choice anyway), but many errors (typically not flushing the writes to the disk) are visible only with kill9 or node failures. You have to test independently. Partition tolerance does not mean fault tolerance.

      1. 4

        Partitions in the G&L proof are any pattern of message loss. Any garbage collection period which exceeds the duration an application will wait for a message before considering it lost (e.g. tcp socket timeouts, application-level timeouts) will get you the same behavior.

        1. 1

          Yep, that’s what I call pushing it into it (application cannot wait forever => there are deadlines constraints). CAP applies in this case (i.e. you really have to choose between consistency & availability).

          GC are still a little bit simpler than a true network partition, because the process stops working for everybody. Agreed, there will be some nasty race conditions at the end. But you don’t loose a full rack when there is a GC. It’s a much nicer kind of fault, one node at a time. In a big data system, if you lose a rack you have to replicate the data again for safety. With a GC you do not need that: the node will come back to life before you start to replicate the 2Tb of data it was storing (not to mention the case w/ a rack!).

          I do agree with you on the asynchronous part: the algorithm you need to choose when the network is asynchronous will help a lot with partitions and with GC. But you need to test both.

          1. 4

            It’s a much nicer kind of fault, one node at a time

            GC is a notorious cause of distributed systems data-loss, typically where it allows two nodes to believe they’re the primary at the same time. Take a look at the Elasticsearch mailing lists sometime for all kinds of horror stories around high load, memory pressure, or GC causing inconsistency.

            1. 1

              I’m not sure if the size of the fault is necessarily relevant for this discussion, either.

              1. 1

                Agreed again, but node failures and network partitions will add a few other horror stories.

                I mean, I would expect a software vendor to say

                • We have tested dirty node crashes, no data loss

                • We have tested GC. No dataloss, no performance issue if the GC is contained within 10s.

                • We have not tested network partitions. Per our design it should be fine (we’re aiming at AP: availability during partition), but it’s still an edge case.

                Rather than: “we’re partition tolerant of course.”

                And for a system like ES (for example), the design for availability under network partition could be something with partial results and so on (harvest & yield). Not that easy to do (I don’t know ES).

                1. 4

                  Agreed again, but node failures and network partitions will add a few other horror stories.

                  Absolutely agreed. The reason I mention GC in this response is because you’ve argued that LANs won’t partition. Even if LAN fabric were totally reliable, I’m trying to remind people that partition tolerance is about message delivery, not just what we traditionally consider “the network”.

                  And for a system like ES (for example), the design for availability under network partition could be something with partial results and so on (harvest & yield).

                  Gilbert & Lynch, footnote 4: “Brewer originally only required almost all requests to receive a response. As allowing probabilistic availability does not change the result when arbitrary failures occur, for simplicity we are requiring 100% availability.”

                  1. 1

                    Absolutely agreed. The reason I mention GC in this response is because you’ve argued that LANs won’t partition.

                    I doubt I said something like this :-) But yeah, for sure the whole post is only about network partitions. I will update the post to make this clear.

                    1. 1

                      “CA exists, and is described as acceptable for systems running on LAN” “Stonebraker was considering small systems of high range servers on a LAN” “it is also fine to assume CA on a LAN”

                      1. 1

                        None of those are mine (lynch/stonebraker/brewer)

                        Both Stonebraker and Brewer consider (quoting Brewer but Stonebraker said exactly the same thing) “CA should mean that the probability of a partition is far less than that of other systemic failures”, so even if they think that CA is acceptable in some specific cases on a LAN that does not mean they think that LANs won’t partition.

              2. 1

                GC are still a little bit simpler than a true network partition… It’s a much nicer kind of fault, one node at a time

                This is usually the case. However, I’ve also seen the back pressure resulting from a GC cause other nodes to become loaded. The other nodes then started a GC. Now there was a feedback loop and the entire system ended up falling over.

                The system could have been configured better… but that’s kind of the same point about experiencing partitions on a LAN. It’s not cost-effective and you’re still going to miss something.

        1. 6

          This article is making me really nervous but I know I don’t have the distributed chops to prove it wrong.

          I’ll say this: when an author starts talking about probabilistic interpretations of the theorem and going on about cost-benefit analysis (seriously, why are we worked up about poor “administration interfaces” here?!) my BS needle starts twitching. And when they do that when an impossibility proof exists that shows element availability and atomic consistency are not both possible, it starts swinging around madly.

          The article reads like an awful lot of language lawyering around fairly well understood concepts, but I’m not sure what the motivations of the author are.

          1. 6

            Heh… Sigh. It reads like an attempt to illuminate, but a bad one. That seems worthwhile if it were shorter and clearer; I don’t think the concepts are actually all that well understood, unfortunately. At a previous job, after two months of arguing that Riak was the wrong choice for the company, I finally got through:

            Me: “What exactly is the benefit you’re hoping for from using a distributed technology? Uptime? Preventing data loss?” Them: “Yes, both of those.” Me: “Those are mutually-exclusive in our situation.” Them: “Oh… Maybe something else would be okay.”

            (And no, they aren’t inherently mutually exclusive, but the data was peculiar and merging later, after resolving a partition, wasn’t an option. I can’t go into it.)

            I definitely don’t want that to be read as an insult to the intelligence of the person involved; they were quite competent. It’s just that databases are a subject not all engineers actually know very much about, and distributed ones are a rather new technology in the scheme of things.

            It’s worth noting that not all distributed systems are databases, too, of course!

            1. 5

              That’s not what the impossibility proof says–he references that paper.

              “In 2002, Seth Gilbert and Nancy Lynch publish the CAP proof. CA exists, and is described as acceptable for systems running on LAN.”

              “If there are no partitions, it is clearly possible to provide atomic, available data. In fact, the centralized algorithm described in Section 3.2.1 meets these requirements. Systems that run on intranets and LANs are an example of these types of algorithms” [0]

              I don’t think CAP is very well understood. I think folks end up very confused about what consistent means, and what partition-tolerant means.

              I think this is pretty well researched. I’m not sure why cost-benefit analysis makes you nervous.

              1. 4

                James Hamilton of AWS says it best, I think:

                Mike also notes that network partitions are fairly rare. I could quibble a bit on this one. Network partitions should be rare but net gear continues to cause more issues than it should. Networking configuration errors, black holes, dropped packets, and brownouts, remain a popular discussion point in post mortems industry-wide.

                Gilbert & Lynch’s implicit assertion is that LANs are reliable and partition free; I can buy this in theory but does this happen in practice? When Microsoft performed a large analysis of failures in their data centers, they found frequent loss occurring that was only partially mitigated by network redundancy.

                But either way you make a fair point: CA models aren’t strictly precluded by that proof. I’m just not certain I’ve seen a network that is trustworthy enough to preclude partitions.

                1. 5

                  Network partitions are not even remotely rare, honestly. LANs are actually worse culprits than the Internet, but both do happen.

                  You already cited one of the better sources for it, but mostly I believe this because I’ve been told it by network engineers who I respect a lot.

                  1. 6

                    Even if network partitions were rare, I’ll tell you what aren’t (for most people): garbage collections. What I did not like about this post is it, over and over again, just talks about network partitions and the actual networking hardware. But weird application-specific things happen as well that appear to be unresponsive for longer than some timeout value and these are part of the ‘P’ as well.

                    In reality, I think CAP is too cute to go away but not actually adequate in talking about these things in detail. PACELC makes the trade-offs much clearer.

                    1. 4

                      LANs are actually worse culprits than the Internet

                      Funny you mention that: over the past few days I’ve been fighting an issue with our internal network that has resulted in massive packet loss internally (>50% loss in some spikes), and ~0.5% to the Internet. That’s probably why this article raised my eyebrows - it’s my personal bugbear for the week.

                      The culprit seems to have been a software update to a Palo Alto device that stopped playing nice with certain Cisco switches… plug the two of them together and mumble mumble spanning tree mumble loops mumble. The network guys start talking and my eyes glaze over. But all I know is that I’ve learned the hard way to not trust the network - and when a proof exists that the network must be reliable in order to have CA systems, well…

                      1. 1

                        Heh - my sympathies.

                  2. 3

                    I think some of the confusion comes from describing all node failures as network partitions. In reality “true” network partitions are rare enough (lasting in durations long enough to matter to humans), but nodes failing due to hardware failure, operational mistakes, non-uniform utilization across the system, and faulty software deploys are sometimes overlooked in this context.

                    i like the comment above “It’s worth noting that not all distributed systems are databases, too, of course!”, but i think this is also a matter of perspective. most useful systems contain state, isn’t twitter.com as a network service a distributed database? kind of neat to think about

                  3. 3

                    It’s not clear to me that the distinction the author makes between a CA and a CP system exists. He uses ZooKeeper as an example of a CP system, but the minority side of networking partition in ZooKeeper cannot make progress, just like his CA example. In reality, CP seems to be a matter of degree not boolean, to me. Why does a CP system that handles 0 failures have to be different than one that handles 2f-1?

                    1. 1

                      When the system availability is zero (not available at all) after a partition, you can claim both CP and CA (that’s the overlap between CP/CA).

                      There are two corner cases when the system is not available at all:

                      • the system does not even restart after the partition. You can claim CP theoretically. The proof’s definitions don’t prevent this formally. But it makes little sense in practice.

                      • the system restarts after the partition and remains consistent. Both CP and CA are ok.

                      But ZooKeeper is not concerned by these corner cases, because it is partly available during the partition.

                      1. 9

                        No, you can’t: a system which is not available during a partition does not satisfy A, and cannot be called CA. If you could claim both CA and CP you would have disproved CAP.

                        1. 2

                          CA means: I have a magical network without partition. If my network is not that magical at the end, I will be CP/AP and more likely in a very bad state, not fully available and not fully consistent.

                          1. 7

                            I’m responding to “When the system availability is zero (not available at all) after a partition, you can claim both CP and CA”. Please re-read Gilbert & Lynch’s definition of A: you cannot claim CA if you refuse to satisfy requests during a partition.

                            1. 3

                              But those magic networks do not exist, so how can a CA system exist?

                              1. 1

                                :-) It exists until there is a partition. Then the most probable exit is to restore manually the system state. 2PC with heuristic resolution being an example.

                                Or, if you build a system for machine learning: 20 nodes with GPU, 2 days of calculation per run. If there is a network partition during these two days you throw away the work in progress, fix the partition and start the calculation process again. I don’t see myself waiting for the implementation/testing of partition tolerance for such a system. I will put it in production even if I know that a network partition will break it apart.

                                1. 2

                                  That system is still CP. You are tolerating the notion of partitions, and in the case of a partition you sacrifice A (fail to fulfill a request–a job in this case) and restart the entire system for the sake of C.

                                  1. 1

                                    It exists until there is a partition.

                                    If a system reacts to a partition by sacrificing availability - as it must, and you haven’t demonstrated differently - how can you claim it is CA?

                                    If there is a network partition during these two days you throw away the work in progress, fix the partition and start the calculation process again. I don’t see myself waiting for the implementation/testing of partition tolerance for such a system. I will put it in production even if I know that a network partition will break it apart.

                                    I feel like I’m in bizarro world.

                                    1. 1

                                      If a system reacts to a partition by sacrificing availability - as it must, and you haven’t demonstrated differently - how can you claim it is CA?

                                      If the system sacrifices consistency (it could also be consistency, or both), then there is an overlap between CA and CP. That’s what Daniel Abadi said 5 years ago: “What does “not tolerant” mean? In practice, it means that they lose availability if there is a partition. Hence CP and CA are essentially identical.”

                                      The key point is that forfeiting partitions does not mean they won’t happen. To quote Brewer (in 2012) “CA should mean that the probability of a partition is far less than that of other systemic failures”

                                      That’s why there is an overlap. I can choose CA the probability of a partition is far less than that of other systemic failures, but I could have a partition. And if I have a partition I will be either non consistent, either non available, either both, and I may also have broken some of my system invariants.

                                      I’m sure it does not help you as I’m just repeating my post, and this part is only a repetition of something that was said previously by others :-(

                                      Trying differently, maybe the issue to understand this is that you have:

                                      • CAP as a theorem: you have to choose between consistency and availability during a partition. There are 3 options here:

                                        • full consistency (the CP category)

                                        • full availability (the AP category)

                                        • not consistent but only partial availability (not one of the CAP categories, but possible in practice, typically 2PC with heuristic resolutions: all cross-partition operations will fail).

                                      • CAP as a classification tool with 3 options: AP/CP/CA. There are a description of the system. CA means you forfeited partition tolerance, i.e. it’s a major issue for the system you build.

                                      And, in case there is any doubt: most systems should not forfeit partitions. I always mention 2PC/heuristic because is a production proven exception.

                            2. 1

                              Could you rephrase your statement? I am having trouble parsing what you have said.

                              1. 1

                                the cr went away. let me edit.

                              2. 1

                                If we take your second case - as it’s the only real case worth discussing, as you note :-) - how can you claim the system is available?

                                The system is CA under a clean network until time n when the network partitions. The partition clears up after m ticks. So from [1, n) and (m, inf) the system is CA, but from [n, m] it is unavailable. Can we really say the system maintains availability? That feels odd to me.

                                Maybe it makes more sense to discuss this in terms of PACELC - a system in your second case has PC behavior; in the presence of a partition it’s better to die hard than give a potentially inconsistent answer.

                                Having said all of this, my distributed systems skills are far below those of the commentators here, so please point out any obvious missteps.

                                1. 1

                                  CA is forfeiting partition tolerance (that’s how it was described by Eric Brewer in 2000). So if a partition occurs it’s out of the operating range, you can forfeit consistency and/or availability. It’s an easy way out of the partition tolerance debate ;-). But an honest one: it clearly says that the network is critical for the system.

                                  Maybe it makes more sense to discuss this in terms of PACELC - a system in your second case has PC behavior;

                                  Yep it works, Daniel Abadi solved the overlap by merging CA and CP (“What does “not tolerant” mean? In practice, it means that they lose availability if there is a partition. Hence CP and CA are essentially identical.”) It’s not totally true (a CA system can lose its consistency if there is a partition, like 2PC w/ heuristic resolutions), but it’s a totally valid choice. If you do the same choice as Daniel in CAP you choose CP for the system 2 above. CA says “take care of your network and read the documentation before it is too late”.

                            3. 3

                              seriously, why are we worked up about poor “administration interfaces” here

                              :-) Because I’ve seen a lot of system where the downtime/data corruptions were caused mainly by: 1) software bugs 2) human errors.

                              I also think that a lot of people take partition tolerance for granted (i.e. “this system is widely deployed in production, so it is partition tolerant as I’m sure everybody has network issues all the time, so I can deploy it safely myself w/o thinking to much about the network”). Many systems are not partition tolerant (whatever they say). That’s why Aphyr’s test crash them (dataloss, lost counters,…), even if they are deployed in production.

                              It does not mean they have no value. It’s a matter of priority. See Aphyr’s post on ES, imho they should plan partition tolerance and implement immediately crash tolerance for example, instead of trying to do both at the same time.

                              I prefer a true “secure your network” rather than a false “of course we’re partition tolerant, CAP says anything else is impossible” statement (with extra points for “we’re not consistent so we’re available”).

                              1. 3

                                CAP tells you that you can’t have both C and A when a partition happens. Most people take that to mean you must choose one or the other and have a CP or AP system. But it’s worth remembering that you do have the option of making sure that partitions never[1] happen - either by making the system non-distributed or by making the communications reliable enough. And for some use cases that might be the correct approach.

                                [1] In a probabilistic sense - you can’t ensure that a network partition never happens, but nor can you ensure that you won’t lose all the nodes of your distributed system simultaneously. Any system will have an acceptable level of risk of total failure; it’s possible to lower the probability of a network partition to the point where “any network partition is a total system failure” is an acceptable risk.

                                1. 2

                                  I think it’s important to modify your statement a bit. What you have to do is ensure that in the face of a partition you remain consistent then try your darnedest to reduce the frequency of partitions. The distinction being you have control over what happens during a partition but not control over a partition happening.

                                  1. 4

                                    you have control over what happens during a partition but not control over a partition happening.

                                    I don’t think that this sharp distinction exists. You don’t have absolute control over what happens during a partition - to take an extreme example, the CPU you’re running on might have a microcode bug that means it executes different instructions from the one you intended. And you do have control - to the extent that you have control of anything - over the things that cause network partitions; you can construct your network (or pay people to construct your network) so as to mitigate the risks. It is absolutely possible to construct a network which won’t suffer partitions (or rather, in which partitions are less likely than simultaneous hardware failures on your nodes) if you’re willing to spend enough money to do so (this is rarely a smart choice, but it could be).

                                    1. 2

                                      I do not think byzantine faults really matter for this discussion, they are a whole other issue to partitions. But I do not think your response invalidates my point at all. Partitions are something that happens to you, how your program handles them is something you do.

                              1. 2

                                This post is of low-quality. The author doesn’t seem to have really grasped the Harvest and Yield paper.

                                1. 3

                                  Could you elaborate? I don’t think you’re right. Your main contention seems to be that you can’t give up partition tolerance, but his claim is that if you don’t have any replicas, you have both consistency and availability, but not partition tolerance, which seems reasonable. It’s just not distributed.

                                  In particular, Jay Kreps mentioned this post as being high quality on twitter, and I trust him to have a pretty good distributed systems horse sense.

                                  1. 1

                                    CAP applies to systems not a single piece of data, which is what harvest and yield are about. In the case of failure not all of your data is available but the system might be. And depending on the semantics of the system, that might be just fine.

                                    When the author says there is no such thing as an AP big data store, that is simply false. I have worked on systems that are AP and big data. It worked because missing some data was ok it just meant the quality of a decision during a failure was degraded.

                                    1. 1

                                      there is no such thing as an AP big data store I have worked on systems that are AP and big data It’s an interesting point. At the data store level, saying ‘I don’t know if I have or had this data’ is considered as an error (excepted for some crazy definitions of datastore: but an eventually consistent store does not allow this). Then an application can perfectly encapsulate this. It depends on the application, but this application by itself is not itself a data store (but yes, it is a big data application).

                                      1. 1

                                        As far as I can tell, both Harvest and Yield are in terms of both data and entire systems. They phrase themselves in success rate, which can be considered an “entire system” idea, but clearly the success rate is bounded by availability of underlying data.

                                        My understanding is that strict AP would mean that even in a partition, you have access to all of the data in a system, unless all of the replicas are down. Instead, the system you’re describing sounds closer to neither strictly available nor strictly consistent.

                                        1. 1

                                          I know of nobody who believes availability is defined as that and I have not had the idea of “strict availability” come up in discussions. And it clearly is not useful definition after a few seconds thought. People care about systems as a whole.

                                          1. 2

                                            In particular, Harvest and Yield and the CAP theorem paper both define availability that way.

                                            Harvest and Yield:

                                            High availability is assumed to be provided through redundancy, e.g. data replication; data is considered highly available if a given consumer of the data can always reach some replica.

                                            And

                                            AP without C: HTTP Web caching provides clientserver partition resilience by replicating documents, but a client-server partition prevents verification of the freshness of an expired replica. In general, any distributed database problem can be solved with either expiration-based caching to get AP, or replicas and majority voting to get PC (the minority is unavailable)

                                            Cap Theorem:

                                            For a distributed system to be continuously available, every request received by a non-failing node in the system must result in a response. That is, any algorithm used by the service must eventually terminate. In some ways this is a weak definition of availability: it puts no bound on how long the algorithm may run before terminating, and therefore allows unbounded computation. On the other hand, when qualified by the need for partition tolerance, this can be seen as a strong definition of availability: even when severe network failures occur, every request must terminate.

                                            And

                                            It is possible to provide high availability and partition tolerance, if atomic consistency is not required. If there are no consistency requirements, the service can trivially return v0, the initial value, in response to every request. However it is possible to provide weakened consistency in an available, partition tolerant setting. Web caches are one example of a weakly consistent network.

                                            It’s fine if you don’t think that reasoning about distributed systems in this way is useful–Brewer says much the same in Harvest and Yield, which is why presented what he considered a more useful metric than strict availability.

                                            1. 1

                                              Great quotes. The theme I see through them is that availability is a spectrum and you can pick spots on the spectrum that make sense to you, which is what I got out of harvest and yield. The CAP Theorem quote you have just says you need an answer, not what the quality of that answer has. I do not believe the article tells that story.

                                    2. 3

                                      To be fair, the harvest and yield paper’s treatment of CAP isn’t any better. For example, it says:

                                      CA without P: Databases that provide distributed transactional semantics can only do so in the absence of a network partition separating server peers.

                                      Even in context of the paper, that’s misleading. The point that the post author is making actually turns out to be a very similar one to the harvest and yield paper. It’s also expressed in a way that’s tricky to follow, but it’s a subtle topic and hard to write about well.

                                      1. 1

                                        The biggest issue I have with the article is that it conflates the availability of a whole system with that of parts of the system. You are right that it is a subtle topic, but I believe this article adds nothing positive to the discussion and even adds some misinformation.

                                      2. 2

                                        I’m doing a lot of head scratching. Maybe there’s an idea here that I’m not getting but its really not laid out well.

                                        1. 1

                                          Sorry about that. The main point of this post is to look at the definitions of availability to show that there are different definitions: A system can be highly-available but not available by the definition of CAP. And then showing how it propagates to common systems. This said, the post is difficult to read if you have not seen previously the CAP theorem proof. If you want to try again :-), you can have a look at the first post of the series. Especially, the post http://blog.thislongrun.com/2015/03/comparing-eventually-consistent-and-cp_11.html introduces some of the definitions.

                                      1. 2

                                        The first row means: “One web-server connected to one SQL database” is a CA system, not available during partitions.

                                        The author is very confusing. CA systems do not exist. I think they are trying to make some meaningful distinction but CAP is really not expressive enough for that, instead PACELC might be a better set of letters to work with.

                                        1. 3

                                          I agree about PACELC, and I think that’s close to the author’s point. There’s a lot of confusion coming from CAP’s particular version of A, which is very useful for some system uses, but is frequently misunderstood.

                                          I wrote about it here: http://brooker.co.za/blog/2014/07/16/pacelc.html

                                          1. 1

                                            Both PACELC and yield/harvest are “beyond CAP”: You want to say that you chose eventual consistency for another reason than cap-availability? That’s PACELC You want to define a system with reduced availability? That’s yield/harvest.

                                            I mention them in the conclusion of the post, but not inside the post because the post is just about CAP.