I think there’s one interesting property queues introduce to systems, and the article seems to take it for granted. Without queuing your capacity (let’s say measured in requests/s) must be greater than or equal to your peak traffic. With queuing your capacity merely has to be greater than or equal to your average traffic. If you’re in an industry where your average traffic is not equal to your max traffic, queues might be a win.
Another important aspect is separation of concerns. If you can keep your frontend from even needing to know about all the downstream databases and api calls that need to be made you’ve not only introduced better logical decoupling, but there’s a good chance you’ve improved the availability of your frontend. It’s relatively easy to keep workers from getting overloaded since they pull work rather than having it pushed on them.
However at the end of the day if your average traffic exceeds your capacity, you’re going to be in trouble. Queuing can buy you time (hours or a small number of days) to increase capacity. You better think twice and then think twice again before you tell a customer: “Sorry, we’re not good enough at our job. send your transactions elsewhere”.
The property you mention is only true as long as you are allowed to have extremely high latencies in your responses that get queued up. Most systems have a qualitative value where either time outs are going to be shorter, users will give up (or get disconnected) before then, or will perceive your quality of service as low and unacceptable.
While the queue will keep the system up to your average load, it won’t keep its quality to an acceptable level. In fact, having a queue may end up ruining the quality of service to requests taking place during slow times in your day, in a fashion similar to bufferbloat: because the peak load tasks are in the queue, they take precedence over the non-load periods' tasks. These tasks end up feeling the slowdown that happens over the peak period, and in the end, everyone has a worse experience.
This can be solved in two ways: replace the queue by a stack (you reduce the overall latency per-task, but the oldest tasks take all of the accumulated time – that’s how phone support systems work and why people who wait long tend to wait even longer), or increase the capacity at any given time.
The question of using a queue to help handle temporary overload is still a complex one when you have to care for the quality of service. Maybe you won’t get to tell the customer “Sorry, we’re not good enough at our job. send your transactions elsewhere”, and they’ll instead go “Sorry, your service is too slow, we’ll send our transactions elsewhere.”
I think the implications of shedding load vs back pressure vs high latencies depend on your use case, I agree that you basically have to pick one. In my ad-tech work where we queue conversions for processing. Shedding load would imply data corruption. Back pressure would imply slowing down the customer’s e-commerce site and breaking their checkout process. High latency means that we’re delayed in reporting conversions.
Sometimes high latency is the least-bad choice.
High latency can be a way of implementing back-pressure, if your upstream’s load balancer is sensible. Shedding load can be mixed with back pressure via things like circuit breakers, or if your service does more than one thing–for example, if your service optionally returns many kinds of search results, but must collate the search results it does return, you can shed load by dropping some of the search queries to downstreams, and implement backpressure by slowing down by however much you need to to support collating the search results.
Fundamentally, back pressure is just signaling back to your upstream that you’re having trouble, and you can do that as well as throw away work if you want.
Sometimes, yes. I also used to work in ad stuff (RTB), where if you’re analyzing incoming bid requests, you’re better off to shed load ASAP because you have 100ms (including network time) to make a decision without risking being penalized by the ad network.
There’s no one-size-fits all solution there, of course. But in the case you mention, your allowed latency is bound by your memory size. If you let it reach that size, you will lose everything in the queue, or if it’s a persistent one, everything that would hit your service while it’s crashed because it went out of memory.
Back-pressure is the name of the game when it comes to queueing. I’m interested to see how the reactive streams project turns out.
In the systems I work on, fixed-length blocking queues are prevalent. They work well in two ways: 1) when the queue is full, adding to the queue blocks the caller, providing back-pressure; and 2) depending on the type of work, a worker can drain N items from the queue to process in batch, which improves throughput.
The red arrow (and the red arrow behind it once the first red arrow is addressed, and the one behind that…) are the only things that should ever be addressed for optimization purposes. Nothing else matters if the throughput is capped.
The red arrow should be spotted and addressed first. If from then on, you test your system and can never get the throttle on the red arrow to trigger, there’s one or more potential bottlenecks that sit above it in the stack, and that should then be addressed in order to be able to reach your optimal throughput.
EDIT: it seems you edited your own post to account for that while I was writing mine, disregard this.