1. 26

I submitted a story last week about a message queue that provides in-order exactly-once messaging with distributed message brokers. In the next few days I saw a lot of the distributed systems people asserting that exactly-once delivery is impossible. If it’s so obvious, can someone explain one or two of these reasons why it’s impossible? I’m no trying to troll, I really want to understand.

  1.  

  2. 27

    Whether exactly-once delivery is possible depends a lot on what you mean by “message”, “exactly-once”, and “delivery”.

    First, ‘atomic broadcast’ IS possible. It is practically possible (and done frequently) to build a distributed system where multiple nodes process the same messages in the same order, all exactly once. This is behind the classic “distributed state machine” approach to building fault-tolerant systems. See http://dl.acm.org/citation.cfm?id=98167 for a classic paper in the area and http://dl.acm.org/citation.cfm?doid=1041680.1041682 for more information on the general topic. In short: building a totally ordered stream of messages and delivering them in the same order to multiple nodes is not only possible, but done frequently.

    So far so good, but there are two big caveats here. One is that getting this right requires significant coordination, which comes with both availability and latency costs. The second is that it’s not really what people mean when they say “message delivery”. Most people mean that each message gets delivered once to one consumer, which does something with that message that has side effects. That becomes trickier, because we need to start talking about failure tolerance.

    Consider the system where the queue hands the message off to the consumer, and does a handshake that makes both agree that the messages has been handed off. Now, the consumer goes off and does something with that packet that has side effects: it changes the world in some way. Finally, the consumer once again runs a protocol which makes both it and the queue agree that the message has been processed. What happens when the consumer fails?

    apy’s post gives two of the possibilities for handling that case. There are others, but they aren’t any better.

    The core problem here is that exactly once delivery is fundamentally at odds with fault tolerance. Exactly-once delivery and processing fundamentally requires that the act of processing, and hence knowledge about the fact the processing happened, is kept at just one place in the system. If that one place fails, the system needs to reconstruct that fact, but has no way to do so. It then needs to decide between re-delivering the message (and possibly having it processed multiple times) or dropping the message (and possibly having it never processed).

    Ok, so it’s impossible. Where does that leave us? It should be pretty obvious to you that many real-world systems rely on exactly-once processing of tasks and messages. How can they do that if it’s impossible?

    Think about Bob, who runs a pizza shop with online ordering. When people order from Bob, their orders go into Bob’s persistent queue. Bob workers take a pizza order off the queue, bakes it, delivers it, and goes back to the queue. Occasionally one of Bob’s workers gets bored and leaves early, in which case Bob gives the order to a different worker. Sometimes, this means that multiple pizzas arrive at the customer’s house (and never less than one pizza). On arriving, the pizza delivery guy asks the home owner if they had received a pizza with that order ID before. If the home owner says yes, the pizza guy takes the duplicate pie with him. If not, he leaves the pie. Each home owner gets exactly one pie, and everybody is happy.

    Short version without pizza: exactly-once delivery is impossible. Exactly-once processing of messages is possible if the processing can be made idempotent.

    1. 16

      I think mjb is exactly right. To expand on this a bit:

      Implementations of distributed replicated state machines in the literature generally assume that operations, once received by a node, are atomically and durably logged to disk. Real disks are not so reliable, which often entails some degree of log replaying, where operations are journaled before being applied to some state machine and applied again to recover from a checkpoint in the event of failure. Moreover, running an operation on multiple replicas is assumed to be safe: if the operation does something like “Lower Gertrude one meter deeper into the volcano”, executing it on one versus three replicas could mean the difference between a successful sampling expedition and a very unhappy geologist.

      Both of these constraints lead us to a hand-wavy notion that operations must be in some sense idempotent in the real world, and on each replica, they have to transform a state deterministically. These properties are key to crash recovery, but not all functions satisfy these properties.

      What people generally mean by “exactly once delivery” of a message is something like “This function will be be invoked atomically and exactly once.” But we know this property is not, in general, satisfiable. Consider:

      def lower
        run_winch :counterclockwise, 1
      end
      

      Now imagine trying to call this function once on a single node, and knowing if we crash, whether the lowering has or has not occurred:

      log :lowered
      lower
      

      If we crash between logging and lowering, Gertrude remains a meter too high to grab her sample. What if, instead, we try

      lower
      log :lowered
      

      Now if a crash occurs between lowering and logging, the computer thinks that Gertrude still needs to go a meter deeper, even though she’s now at the correct altitude. The geologist and winch engineer wind up having a tense conversation punctuated by the odor of Gertrude’s burned boots. They decide instead to augment the lowering process with some extra information:

      def lower(height_above_lava)
        current_height = gertrude.rangefinder.height
        run_winch(...)
      end
      

      Together, they’ve modified the task itself so that it is idempotent. Notice that they had to couple the idempotency of this function to the state it manipulates–e.g. Gertrude’s sensor package–so that it is safe to call multiple times.

      tl;dr: In any message queue, or any transactional system in general, we cannot provide atomicity guarantees for arbitrary side-effecting functions. Those functions must be carefully designed to allow for repeated invocation, because repeated delivery is implicit in crash recovery where some state may be lost.

      1. 1

        I think this covers the case of the subscriber client pretty well. Does this also cover the case of the broker also (meaning that we assume that a queue is a side-effecting data structure)?

        1. 3

          It’s not clear to me that you can think about a broker without clients as a meaningful message queue.

      2. 1

        So exactly-once is possible if there isn’t an in-order requirement? (I assume that’s what you mean by requiring idempotence)

        1. 5

          No, perhaps I explained poorly.

          Talking about idempotence was trying to explain how systems typically get around the problem of exactly-once being impossible. Basically, you embrace the fact that you can’t apply every operation exactly once, so you design for at-least-once. If you make your operations idempotent, then their effects end up being applied exactly once.

          The goal of systems designed like this is delivery at-least-once (for completeness) and approximately-once (for efficiency). Idempotence then gives exactly-once effects at the cost of a little bit of efficiency. Obviously the challenge is designing operations that are idempotent (and commutative and associative if conditions require it).

        2. 1

          Exactly-once delivery and processing fundamentally requires that the act of processing, and hence knowledge about the fact the processing happened, is kept at just one place in the system

          Let me expand this, tell me if I’m wrong.

          My first instinct is to say “No it doesn’t because the processor (client) sends an ACK back to the broker”. But the problem with my statement would be that the ACK may not arrive.

          For instance, the server that I send the ACK to dies right after sending me the message so it doesn’t receive my ACK. The failover server picks up where the other server left off and resends the message, in which case I get the message delivered a second time even though I’ve already successfully processed it.

          My rebuttal is “what if the load balancer has knowledge of which servers have a replica of the session and can smartly choose one of them to route the traffic to?”. This took me a while to figure out but I eventually realized that there’s still a gap between when the server becomes unavailable and when the cluster (and LB) realize that it’s unavailable. So there’s still plenty of time where my ACK will get dropped without the failover server recognizing it. Again, the problem here is that a network failure appears as as an unresponsive server, but so does a long garbage collection cycle (or many other normal, naturally occurring tasks).

          1. 2

            In order to solve exactly-once delivery by adding another node…first you must solve exactly-once delivery. Just, inductively, if it’s impossible to solve with N nodes, it will be impossible to solve with N + 1 nodes.

            EDIT: Also note that this problem is not just queues, it’s any communication. Exactly once HTTP requests are impossible as well.

        3. 5

          Consider two delivery types: deliver and the handler acks on finish. Or deliver and auto ack.

          1) What happens if the handler does all of the work and dies right before the ack is sent to the message queue?

          2) What happens if the message doesn’t make it to the handler, or it dies while processing, in the auto ack scenario?

          1. 3

            I just got a response to me on Twitter that clears it up a little bit.

            clemensv: @kellogh @aphyr the assurances are for when routes or nodes fail. For “exactly once” you’d need full system consistency through the failure

            The exactly-once delivery guarantees are in place precisely because networking equipment fails. (Coming from the IoT world, I’m more concerned with lightweight clients being offline to conserve power, but servers and networking equipment are also concerns). In distributed systems, failed networking equipment causes network partitions. So in order to fulfill exactly-once delivery you have to choose the CP in CAP. But because we chose to lose availability, we can’t have exactly-once delivery.

            This almost makes sense to me, but I get lost on my last sentence. From what I understand of CAP, the A means that we no longer accept writes. In the case of a message queue a “write” would be accepting new messages (correct me if I’m wrong). So in a CP message queue, if a network partition is detected we should return errors to any new messages and force the client to resend them later when we no longer have a network partition.

            I’m also not convinced that exactly-once delivery requires CP. My mind is going in-and-out of accepting this, so if anyone has an example, that would be great.

            1. 2

              For further reading/watching, checkout out:

              Idempotence Is Not a Medical Condition - Pat Helland http://queue.acm.org/detail.cfm?id=2187821

              Immutability Change Everything - Pat Helland http://vimeo.com/52831373

              Byzantine Fault Tolerance http://en.wikipedia.org/wiki/Byzantine_fault_tolerance

              The Saddest Moment - James Mickens http://research.microsoft.com/en-us/people/mickens/thesaddestmoment.pdf