1. 19

  2. 7

    I recently had a long conversation about this at Clojure West. We were discussing techniques to building a scalable, easy to administer Service Oriented Architecture (SOA). The 2 belief camps were:

    1. Use a message broker, and build synchronous RPCs on top of it.
    2. Use HTTP directly

    The supposed benefit to using a message broker is that it has automatic retries and failover, and by using something like HornetQ, you even get XA (aka Distributed) transactions, so that you can get exactly-once message delivery. It turns out that this isn’t exactly the case.

    The benefit of HTTP is that it’s already an extremely well-supported RPC layer, with tons of platform-agnostic “middleware”, such as nginx, HAProxy, and Varnish.

    My least favorite part about message brokers is that you have to configure them, and if you mess up replication and durability, you could have very difficult to diagnose errors.

    The point at which the queuing system breaks down is in the idea that it’s providing a simplified model. Suppose that you are building a web-service: you need every request to be matched with a response. Now, you’ve just built synchronous RPCs on top of a queuing system. At this point some developers say, “Aha! But this is now asynchronous, and thus more efficient.” Luckily, we have many asynchronous HTTP servers, like node.js and http-kit (for Clojure).

    But then, what about transactions? Isn’t it nice to have that built into the queuing layer itself? There’s a flaw in this argument, too: suppose you’re implementing an RPC that charges a credit card. This should obviously happen exactly once. What if we built it on top of a queuing system’s exactly-once delivery semantics? There are several possibilities, all flawed:

    • The message arrives, the payment is processed, but then the process crashes. The delivery acknowledgement didn’t get sent, and so the message is resent, resulting in a double-bill.
    • The message arrives, the delivery is acknowledged, but then the process crashes. The payment was never processed.

    The correct solution is to include exactly-once semantics at the application layer:

    • The message arrives, the process looks up in a replicated, distributed database (like ZooKeeper) whether this message has been processed. If it has not, it processes the payment. If it has, it does nothing. The message is finally acknowledged.

    As a result, regardless of the number of redeliveries, the customer can be billed only once.

    This last mechanism is the only way to ensure exactly-once semantics. The key insight here is that exactly-once semantics are not a technical requirement, they’re a business requirement. As a result, the only way to correctly implement them is in the business domain.

    1. 4

      The whole article seems to hinge on a message broker as a single point of failure. I personally haven’t used a broker with that flaw in years.

      1. 2

        Even brokers like RabbitMQ are hard to configure to guarantee reliability and durability. Also, in those configurations, performance is greatly impacted.

        1. 1

          What about SQS?

          1. 1

            SQS is easier, but has a big problem: for many enterprises, it is out of the question due to being off-premises. For me, working in such an enterprise, we must always look for self-hosted solutions. Even if SQS is configured correctly, the fundamental performance bottlenecks/issues still stand.

            For an SOA, a broker introduces an SPOF that every message must route through. A better design is what amontalenti said below: a peer-to-peer connection system. I would additionally argument that an RPC mechanism like HTTP, combined with idempotent RPC actions and automatic retries, is even more robust than peer-to-peer queuing.

      2. 4

        For me this article is a big “we didn’t know how to use message brokers, we had a problem due to misuse, the fault is the message broker”.

        For me message brokers are no panacea, but having a CPOF is part of how one designs and architecture, not a message broker.

        I don’t know what are other people’s experience, but in the case of RabbitMQ, you can retry messages, you can use publishers confirms, and there are lot of options for preventing unbounded buffers: https://www.rabbitmq.com/blog/2014/01/23/preventing-unbounded-buffers-with-rabbitmq/

        Also I see criticism like “you have to configure them”, what software works out of the box, at scale, with default configuration?

        People spend lot of time configuring for example a MySQL installation, or any other DB, but when it comes to a message broker I don’t understand why they have to “magically work”.

        NOTE: I work for RabbitMQ so of course I’m biased.

        1. 1

          We are using RabbitMQ because it’s pretty decent and has a lot of functionality you don’t want to have to write on your own, at least for an initial product. However, as we’re maturing we are beginning to replace RabbitMQ with a more end-to-end solution. The biggest reason for this is that using a message broker hides a big layer of complexity: what if things goes horribly wrong and you lose them, for whatever reason, even with clustering. Your catastrophe just a lot worse because you have to figure out what you need to resend.

          By pushing that responsibility to the end components, the need becomes clear.

          Getting rid of the brokers is actually not that problematic for us because we need local queuing locally anyways, incase we are unable to connect to the brokers. So this mostly becomes an exercise in flipping out the RabbitMQ client with our own.

          This isn’t a high priority for us (yet) because, as I said, Rabbit is actually pretty decent, but I think in the long run it doesn’t add much to a mature infrastructure and if anything hides nasty corner cases in ones fault-tolerance.

          1. 1

            I understand from where you are coming from, but I think the original article is a case of using the wrong tool, or using a tool badly. Complaining that an MQ loses messages or retries messages due to a worker crash, is basically saying that the MQ doesn’t provide msg acks, or the person doesn’t know how to use acks. From there to jump to a conclusion like: “don’t use MQs” I think is a bit far fetched.

        2. 3

          He correctly identifies the problems with traditional “queue and worker” systems. Combining projects like Apache Storm and Kafka together gives you the semantics he recreated in the edges, but with the benefits of tunable parallelism, parallel deployments, computation graphs, and fault tolerance against machine or cluster failure. Moving work into the network layer (a more common approach than what he described might be to use ZeroMQ) does simplify things, and if you are willing to build your own reliability, deployment, parallelism and fault tolerance layers atop that, it can be a great choice. After all, that’s exactly what the Storm project is – a layer for reliability, deployment, parallelism and fault tolerance atop ZeroMQ!