1. 25
  1.  

  2. 28

    This article completely ignores the two real arguments for queues:

    • they free you from having to have 100% uptime of all services to avoid data loss
    • they mean that, if you’re willing to deal with increased latency, you only need enough capacity to handle your average load instead of your peak load
    1. 6

      No, it doesn’t. Those are both highly solvable without a gigantic error prone piece of middleware crap.

      If you don’t want data loss, put your events in a real, dedicated, durable data store, not the hokey nonsense durability “provided” by pretty much any queue daemon. If your service isn’t up to receive the notification, it doesn’t really need that notification right away does it? When it does come back up, it loads any older notifications stored in a real datastore that will definitely still be around because you stored them in a real datastore.

      Any data not stored in a proven durable storage layer is subject to loss. For example, RabbitMQ, “in the presence of partitions, RabbitMQ clustering will not only deliver duplicate messages, but will also drop huge volumes of acknowledged messages on the floor.”

      Pretty much all queue middleware has this issue. Why? Because getting correct exactly once semantics in a distributed datastore is hard. So instead of assuming any web hipster scrub can write consistent distributed software, use a distributed data store with real transactions (or whatever is acceptable to your application). And to send messages between a large number of producers and consumers—more messages than would fit through one machine—send the messages directly N to M rather than artificially adding a bottleneck by introducing middleware junk.

      FWIW, this is how Google handles pubsub almost universally.

      1. 8

        Pretty much all queue middleware has this issue. Why? Because getting correct exactly once semantics in a distributed datastore is hard.

        It is not only hard, but impossible. http://bravenewgeek.com/you-cannot-have-exactly-once-delivery/

        Any data not stored in a proven durable storage layer is subject to loss. For example, RabbitMQ, “in the presence of partitions, RabbitMQ clustering will not only deliver duplicate messages, but will also drop huge volumes of acknowledged messages on the floor.”

        I have lots of love for RabbitMQ, but it isn’t a backpressure queue, but a fast messaging system. Any way to use it otherwise (e.g. as a logbuffer for the ELK stack) is bound to bring tears. RabbitMQ shines when clients are quite aware of these behaviours. Any distributed system will lose data at some point.

        If you don’t want data loss, put your events in a real, dedicated, durable data store, not the hokey nonsense durability “provided” by pretty much any queue daemon.

        Kafka is such a store, with queue semantics. It saves us from disaster regularly.

        Picking the wrong queue or using them where not needed doesn’t make queues bad.

        1. 1

          Assuming Kafka does perfectly handle all of that, I am still not convinced there is any reason to use it over your primary datastore. It sounds like extra operational overhead, complexity, and hardware spend to me.

          Unless you’re using Kafka as your primary datastore. I probably could be sold on that architectural decision in very specific circumstances.

        2. 7

          How would you implement queue semantics in a durable data store? Can you get performance that is close to RabbitMQ or Kafka? I mean, doing queues in Cassandra is pretty much an anti-pattern.

          1. 1

            And to send messages between a large number of producers and consumers—more messages than would fit through one machine—send the messages directly N to M rather than artificially adding a bottleneck by introducing middleware junk.

            The durable datastore is only for recovering undelivered messages in bulk if you weren’t available to receive them for some reason.

            So you can get performance close to RabbitMQ or Kafka if you deliberately write slow code to mimic the effects of sending your messages through an extra service. Unless RabbitMQ or Kafka have 0% overhead, and your machines hosting them have infinite bandwidth, then it’d be about the same.

      2. 31

        The OP sets up a straw man, complete with mythical Friedman Cab Driver, and then does an abysmal job of knocking it down.

        The problem with Fig 2 is that in exchange for the simplicity, you have lost all decoupling between the components. This means that

        1) the sending service has to have intimate knowledge of all of the receiving services at time of send.

        2) all receivers have to be up, accessible, and immediately ready to accept a message.

        The problem with 1 is an organizational and technical scaling problem; if all message queue participants have to be known a priori, then the act of bringing up more receivers to handle increased load is hard, and the act of bringing on new teams and new handlers is hard.

        The problem with 2 is a technical problem. Not all networks, senders and receivers are always up all the time; but frequently the business wishes for its messages to survive common minor and temporary technical accidents. So if a receiver dies, or a network switch dies, then what is the sender to do? The usual, most correct answer is to try again.

        But in order to try again, the sender must keep the message around, and possibly maintain relative message ordering. In order to maintain relative message order over a period of time, it’s necessary to buffer several messages.

        But it would be silly and wasteful and error-prone for each and every application to implement its own ordered buffer of messages, and besides, then every application would have to deal with transactionality, permanent storage, and message identifiers; so perhaps the act of buffering should be abstracted out into a small, specialized system dedicated to those functions, and we could communicate with that component using a standard protocol that hides the messy implementation details from the sending applications. I wonder what we would call such a software system.

        1. 7

          A few comments on your comment:

          • Your comment leaves out that you’re putting all your eggs in one queue, and if that goes down, everyone goes down. So if not losing a message really is very important to you, everyone needs to implement sender-side store-and-forward anyways, so the queue is not saving you anything in that case.
          • For your point 1, I assume you are talking about service discovery. A queue might make this simpler, however all senders and receivers still need to agree on the name of the queue to use, so there is some service discovery going on. For a large organization, this isn’t necessarily simpler than any other form of service discovery.
          • There are also a large number of use-cases where discarding the message on failure is acceptable. Keeping my web request around for more than 10 seconds is not helping the situation.
          • Depending on the queue you are using, it can be an incredibly complicated piece of software for everyone to depend on. RabbitMQ is a beast and its HA options aren’t even that good. If you trust it, it’s fine, but if complexity concerns you, a queue is concerning.
          • There are some possible consistency issues with a queue, for example reading your own writes. If you’re used to EC systems then it probably isn’t a big problem, then.

          As with most things, it depends on the problem. I used RabbitMQ for quite a while for a large project a few years ago and ended up deciding it added very little value for our particular use-case. I’m very skeptical of anything that looks like RPC over an MQ, though.

          1. 4
            • you do have to worry about connecting with the queue service, granted. But you can make the queue service independently resilient if it’s a separate service, and focus engineering attention on ensuring high availability on that service, rather than ensuring high availability in all sender services. Then in the rare cases where your queue service fails at the sender, it can be made a highly exceptional case.

            • service discovery is part of the problem, but also dynamic service addition, subtraction and change. If I send a message to a queue, I don’t have to care about the numbers, or the kinds, or the numbers of kinds, of programs looking at the message on the backend. This abstraction is key when you have a number of autonomously operating teams that are working on the same data stream, which is frequent.

            • discard on failure can be fine, but ‘at least once’ is usually more like what the customer wants, in my experience.

            • queues can be pretty complex, but even little queues like redis and nsq can be very useful at the low end. Certainly if you’re diving into service oriented architecture and deploying microservices, you should be at least operationally strong enough to operate any queue service, because that will turn out to be the least of your problems.

            • you do need to understand your consistency and availability guarantees – but this is not unique to queues, it’s fundamental to any distributed system, including the HTTP RPC approach the OP tried to outline.

            Sorry you had a bad experience with rabbitmq. It’s not great if you try to use its built in persistence, I’ll completely agree, even as an erlang-head. But any queueing system is better than any non-queueing system in the presence of a need to organizationally or volumetrically scale, especially where load is like real world load.

          2. 2

            Mostly I agree with you…

            But I have noticed an anti-pattern in the world of pub-sub.

            the sending service has to have intimate knowledge of all of the receiving services at time of send

            a) The sender knows that something might be receiving, otherwise why is it bothering? b) The sender has to send something of value, and value is defined by the receiver not the sender. c) In some systems I have seen pub-sub is short for “I don’t know what I’m doing and it’s all too complicated and I can’t think this through end to end .”

            Usually when I see this anti-pattern, I’m looking at that system because the end-to-end performance is utterly shite and somebody has asked me to fix it.

            When that anti-pattern exists, I work out what the big picture is work out a proper layering of what should be ignorant or what, and what needs to know about what in order to function and where the abstract interfaces should be.

            1. 1

              If there’s not some form of general-purpose data format that’s shared across the board for publishers and subscribers, it’s pretty much useless, and that takes care of A and B; of course you’ll have variations depending on the actual “message”, but the payload should be accompanied by sufficient metadata for services to be able to determine what’s useful to them.

              In the case of C, I can’t reason around “I’m gonna need a queue” without thinking “because I’m going to have more than a few services that need to perform tasks asynchronously depending on the messages in the queue, without necessarily having to call back to the originating service otherwise than by throwing a message back on the queue.”

              That might be a gross oversimplification, and I might be doing it SO WRONG; in either case I welcome the discussion. I think if there weren’t a use case for queues, people would stop trying to build them. Coincidentally, when all you have is a hammer, everything starts looking like a nail (or when you’re determined to use a queue no matter what, you’re probably going to fuck up badly, pardon my French.)

              1. 1

                queues share the same problem as OO design; they’re an easy-to-screw-up, high-cost abstraction in most implementations. This is one of the reasons why Erlang and the actor model are so good: you get the queue abstraction’s benefits with a significantly lower implementation cost, built into the philosophy, the language and the runtime.

          3. 7

            This was actually a great argument, despite the (imo) off-putting title. tl;dr - Don’t use queues for communication between services. Use HTTP for that, and use queues for distributing workload within a service when it does a lot of parallel processing.

            1. 3

              I’m assuming that the overall point of this is the fairly obvious conclusion that messaging queues are not the end-all-be-all magical solution to interprocess communication, but boy oh boy is the initial presentation trashy clickbait.

              1. 2

                I don’t use this generic enterprise service bus pattern with queues. Some orgs do, but I’ve not been there.

                What I tend to do is use it as a replacement for HTTP when I don’t need a response - the queue service is HA with durable messages (SQS, if you’re curious), and the consumers are unreliably there. The messages get axed when correctly processed, or sent off to a dead letter queue after a certain duration of time. Generally I strive for idempotent operations as a matter of design, just in case, but of course you can’t always do that. The key idea here is that it allows a buffer for spikes, as well as transient failures on the consumer side. There are drawbacks, of course - the added complexity is not always trivial in this design, but it’s far simpler than the article’s pubsub system. And, while it centralizes a lot of things into a SPOF system, it allows the contra-wise advice of “putting all your eggs into one basket, then watching that basket very hard”.

                1. [Comment removed by author]

                  1. 3

                    Do you include HTTP2 in this? gRPC, for example, is HTTP2, with protobufs, and is growing in popularity.

                    1. 3

                      What protocol and/or servers would you recommend as an alternative for this kind of architecture (outside of the aforementioned message queues)?

                      1. 2

                        Thrift . There are other options (e.g. gRPC is a perfectly respectable choice), but Thrift is the most mature and I haven’t seen anything else offer a compelling advantage over it.

                        (Thrift does have a http transport for when you need that, but that’s an implementation detail; the important thing is that you parse using the thrift libraries).

                      2. 5

                        HTTP has (by far) the best caching support, the best chance of making it through any given firewall, the most development tooling, and it’s the best-understood protocol by the average engineer.

                        But ok. I won’t use HTTP because it’s the JavaScript of protocols.

                        1. [Comment removed by author]

                          1. 1

                            I have worked on enough HTTP APIs to know that HTTP isn’t usually the source of the practical problems they face.

                            It’s a great default choice. Once an API starts to outgrow HTTP, it’s time to look at other protocols and offer one or more of them alongside HTTP, but most APIs will never get to this point.

                            (To answer your specific questions: HTTP does retries exactly as well as TCP, and HTTP verbs imply whether a request can be retried safely; pagination is out of scope for HTTP and also Thrift, Protobuf, Cap'n Proto, etc; search (and other RPC) is easy to model in HTTP, use a POST request and don’t worry about it)

                      3. 2

                        Hi, do you want to guarantee that a message reaches its destination? Then some sort of queue/message-bus approach is what you need. It doesn’t have to be a pub-sub pattern. But it can’t be just raw HTTP.

                        Yes, a lot of message queue solutions are over-engineered and more complicated than they need to be for the problem at hand, but there’s a reason why they exist.

                        1. 2

                          How does a MQ help guarantee that a message reaches its destination? The MQ can go down.