Threads for neurodrone

  1. 6

    I found the conclusion confusing:

    Networks do lose packets. But TCP allows to hide packet losses to the application, and that’s what the CAP proof actually does as well. In any case, packet loss is irrelevant when using the CAP theorem: CAP is about partition, not node failure or packet loss. While it is always possible to make the right decision despite basing it on flawed logic, it is generally better to use a theorem only when it can be applied to the problem: dead nodes and packet losses are bad, but they do not force you to make a choice between availability and consistency.

    What I do not understand about this conclusion is why that would be true. For example, take a consensus algorithm that requires a quorum, it doesn’t actually make a difference between a partition or a node being down or packet loss how the consensus algorithm needs to handle it.

    The author seems to be making a distinction between modes of failure that are observablely indistinguishable. In practice I do not see the value in distinguishing these things, but I cannot tell if there is value in distinguishing these in practice.

    1. 7

      It seems the idea is that if a partition is temporary it somehow doesn’t count. Like if the net goes down and you lose packets, just pause the world until TCP retransmits. And if there’s an earthquake and the net is down for three days, just reconnect. See, no partition!

      Now I personally would be more inclined to call that “no availability”. Which is what I think is really going on. It’s actually CP and gives up A while the packet is retransmitted. CAP doesn’t really special case brief loss of availability that maybe your users won’t notice.

      1. 1

        I think it depends upon the kind of packet loss. If you sever the tcp connection, and aren’t able to finish consuming the byte stream, clearly that represents a loss of availability. But if a few packets drop, and tcp retransmits them within a few milliseconds, why does that mean a loss in availability? It could even be at a switch, so that neither the client nor the server realizes that a retransmit had to happen.

        If any kind of dropped packet meant a loss in CAP availability, then anything over a network would not be CAP available, because every network is lossy.

        1. 1

          Perhaps we need a term “eventually available” like eventually consistent.

          1. 1

            But if a few packets drop, and tcp retransmits them within a few milliseconds, why does that mean a loss in availability?

            That depends. The Gilbert & Lynch paper maintains the following notion of availability:

            The availability requirement implies that every node receiving a request from a client must respond, even though arbitrary messages that are sent may be lost.

            It means that even in the light of ephemeral packet losses availability can still be maintained if the server is able to dispatch appropriate responses to its client for every request the latter has made. If few milliseconds of delay is an acceptable criteria for a valid response then the system should be termed as available as long as the client keeps receiving such valid responses. This is regardless if such responses are received over a same tcp connection or by establishing a new one.

            A partition could mean that in a single request-response round enough packets are lost within the client and server link that either the client request never makes it to the server or the response sent by the server doesn’t make it to the client (assuming it actually was able to send one from its local POV). If this is a network blip and subsequent client requests start receiving valid responses then we can say that the partition healed quickly, for some temporal definition (and hopefully overlapping) of blip and quickly.

            1. 1

              But if a few packets drop, and tcp retransmits them within a few milliseconds, why does that mean a loss in availability?

              It depends on what your timeout is. I think the question is not “can I construct a scenario in which case packet loss does not affect me” but rather “can I construct a scenario in which packet loss does affect me”, and the answer to that question is clearly yes.