I really like the animations in this article, and I like that this article is targeted for people who are new to the area, but it’s missing some important caveats.
RE: least connections, if a server starts responding very quickly with errors then the load balancer will “black hole” traffic because this server is suddenly very successful and beats all other servers. Common mitigations are load-balancer health checks to hosts, and some system sitting outside the load balancer and hosts monitoring for anomalous server-level behavior. You need to think about error handling and health checking when load balancing.
RE: random, I suggest checking out two-choices load balancing: https://www.eecs.harvard.edu/~michaelm/postscripts/mythesis.pdf. It is a surprising result. The downside of this approach is that the article then becomes more complex because you need to consider state from servers and the staleness of that state. I was very surprised at 1) how bad random is, 2) how optimal two-choices is in comparison to more-than-two-choices.
Thanks for the kind words!
You’re totally right, it is missing this important caveat (and a few others!). It was tricky to decide what to include and what to cut. I had made attempts at adding errors (as well as server slow-start) to the simulations and ended up cutting them because I couldn’t figure out a nice way to visualise them, and it added a lot of extra complexity to the simulations.
It’s my first go at using PixiJS, I’m hoping to get a lot better over time and when I am I might revisit this and see if I can cover all the awesome stuff people have suggested I should :)
The SRE book chapter on load balancing talks about this problem as well. Lots of good lessons condensed in there!
“2-random” is also a pretty nifty cache eviction algorithm. There are probably other domains in which this approach works well, too.
For the cache eviction case, basically: you just pick two random cache entries entries and evict whichever of that pair had the worst recency/frequency. It’s not as theoretically-optimal as methods with more-complex bookkeeping, but in the real world it’s a pretty good trade:
Couldn’t you just monitor the 500 responses and mark a server as broken ? (Obviously the same goes for timeouts.)
That would be passive health checks; in other words that the load balancer takes note of the responses it get from the backend before passing it along.
The other way is active health checks where the load balancer sends a specific request to each backend servers at a regular interval.
Both have pros and cons. You could also combine them.
The number of 500 status responses I see on the daily means not enough people understand the content of this blog post. Great blog, hope people read it and we see some change.
One nuance that’s missing here is that most load balancing uses a few different tiers.
I really like this article about an example four tier highly available app tier load balancing architecture:
DNS->IP->TCP->HTTP is pretty evergreen, even though this post is half a decade old.
At my previous job, load balancing was done in a unique way. We had a frontend  take requests, pull out the required info, and make a request to the backend. The frontend knew about all the backend servers and would do its own load balancing. The backend, as part of its reply, would indicate a “loading factor,” and when said “loading factor” got too large, the frontend would take that particular server out of rotation. The frontend would then start sending a simple health check on that particular backend, and when the backend returned a good “load factor” it would be returned to the pool of available servers. It worked remarkably well for us.
 Well, two frontends—one that talked SS7, and one that talked SIP. We were doing telephony stuff.
It’s preferred to estimate load at the load balancer (like the connections strategy), because load factor reported by the server is delayed by response latency. This delay is dangerous and you may get the bullwhip effect or a stampede, e.g. the server says it’s free, so gets more traffic, and before it has a chance to say that it’s busy now, it already has excessive traffic thrown at it.
One factor is that from the front end receiving a request to it replying was under a tight deadline , and a backend server would be penalized for timing out. Second, there were, if I recall correctly, 4 frontends (since they did very little work), and 64 backends (which implemented the business logic), and the backend selection was round robin (skipping ones deemed to loaded too handle a request). I don’t recall there ever been a bullwhip effect seen in the 12 years I was there. Also, it avoided yet another component (a separate load balancer) that could potentially fail.
And remember, every backend response would include the loading factor, and a “high load, back off” value was configurable (I forgot to mention).
 When I started, the front end had 3 seconds to respond to the request (if the backend timed out, we sent a “no data” response). Later on, that was tightened to 1.5 seconds, so there was an upper limit on the latency between the frontend and backend.
Reminds me of this oldie but goodie:
so pretty :)
This is amazing. One thing not talked about is difficulty of implementation: cost to compute the metric and how well it scales across multiple routers.
Assuming you don’t want global locks and you’ve got multiple entry points to each server (distributed routing) then some of the things like round robin end up devolving to random routing at scale.
The Heroku router on common runtime (private and performance dynos get a dedicated router) is distributed and it’s really hard to beat random as synchronization and bookkeeping costs become non-trivial.
I understand that not many container clusters need to operate on “Heroku scale” so perhaps it’s a bit of a niche tidbit. Still I wanted to mention it. Again, great article.
Thank you so much!
I very, very intentionally glossed over the cost of making load balancing decisions. It’s entirely because they were challenging to quantify and visualise individually. This played in to the reason to not show the “power of 2 choices” algorithm, because one of the things it does well at optimising is the cost of deciding where to send a request.
I’ve no doubt they could be visualised, but I couldn’t personally figure it out in a way that meshed well with my overall goal of exposing people to a set of load balancing algorithms of increasing complexity.