1. 28
  1.  

  2. 5

    I work on a library called Finagle that has spent roughly the last ten years plumbing this area, and you’re right that it’s still a pretty active area of research! I think academia used to be interested in this area ~30 or 40 years ago, but it’s no longer very fashionable.

    I think Finagle has pretty neat approaches to many of the problems you’ve laid out, so I’ll describe a few of them. In many cases we’ve come to similar conclusions as you, and in some we’ve come to very different conclusions. Feel free to reach out on the mailing list if you have more questions.

    Health Measurement: Finagle’s load balancer balances over a host abstraction, which has the concept of “status”. We keep track of whether an instance is “open”, “busy”, or “closed”. If an instance is “busy” or “closed” we assume it won’t serve any requests, and don’t route to it. At a different layer, we measure failure both from a connection point of view (if we’re unable to establish a connection, use an exponential backoff before attempting to reconnect, and fail connections proactively in the mean time), and a request point of view (you can specify a failure accrual policy, and we have “consecutive” or “proportional” failure accrual policies). We also do connection healthchecking, but its benefits are sometimes unclear. You can also be made “busy” if your remote clear issues you a goaway, or if your lease expires. We’ve also investigated whether we can have a continuous “liveness” measurement, instead of the binary “healthy/unhealthy” but we haven’t done the work to switch over to that yet, although our experiments suggest it’s pretty fruitful.

    Load Balancer Metrics: We’ve done quite a lot of research here, mostly focused around least loaded and latency. We’ve also implemented p2c. We haven’t seen huge issues with the mobbing behavior that you’re describing, perhaps because we don’t see significant differences in latency, and because we use p2c so we don’t converge as quickly as you might with a more exact strategy. In general, we’ve seen very good results with a peak exponentially weighted moving average latency approach, which we call “Peak EWMA”, and you can check it out here. We collaborated with linkerd, a company that was originally built around “Finagle as a Service”, to compare behavior of a few of the different load balancers.

    Recently we’ve been focusing on a strategy we’re calling aperture where we talk to just a subset of remote peers, instead of the whole cluster. This has significant advantages when the cluster you’re talking to is enormous, but it means that you now have two load balancing problems, one for picking which hosts to consider, and then a second one of picking which host within that subset to pick for a given request.

    We’ve considered having servers advertise a load metric to the client, but haven’t invested heavily in it, because what we have works pretty well so far.

    Load Shedding: We were able to opt every client at Twitter into load shedding by ensuring that we shed load gently, and that we only do it when servers are rejecting traffic at high volumes.

    Some prior art that might be interesting to you, at spotify, and netflix. If you want to learn more about finagle clients, take a look here.

    I thought many of your conclusions were sound–but one sticks out to me as worrisome, which is using weighted random selection. My concern is that if all of your weights are the same, it devolves to random, which may converge to a uniform distribution on a long enough time horizon, but it’s actually a binomial distribution, and it will be obvious on a short time horizon, since you’ll always have uneven instantaneous load. I instead encourage you to use p2c, and use the weights that you would have assigned in weighted random selection as the load metric, which I believe achieves the same thing but converges much faster, so that instantaneous load won’t be wrong.

    1. 3

      Oh wow, this is great! Finagle (and the other resources you linked) somehow never came up in my searches. I’ll have to read up. :-)

      p2c is two-choice, right? It’s a pet peeve of mine when people call it the “power of two” method (nothing is being raised to a power!) and “p2c” seems like a reasonable name—I’ll have to start using that, especially if it’s more common. Interesting to hear that you haven’t seen much in the way of mobbing. I’d expect that with p2c (as you said, convergence is slower), but I’m curious if you’ve seen it with other methods. I wasn’t sure how realistic a concern it was.

      I share your concern about weighted random selection. In my test runs I wasn’t able to get good information about high-frequency variation in weights, but we’re doing a dark launch of an algorithm using weighted random selection and will be collecting metrics on the instantaneous ratio of the highest and lowest weight under actual production circumstances. The weights are multi-factorial, being the product of 4 health factors, each derived from one of latency (inverse of a decaying exponential average), success (decaying exponential average), concurrency (inverse), and connection age (ramp up from epsilon to 1 over the first minute). The success factor is raised to the power of 4 or 5 to give it more effect on the weighting. (I’m sure this can all be heavily optimized to avoid the floating point math, but that’s not our bottleneck.) In the trial runs, WRS did fantastically well, but I do have some concern about high frequency variation in the weighting. I’m OK with flapping if it’s solely due to the very coarse-grained concurrency factor, since that has near instant feedback, but I’m less sure how it will interact with the others. I won’t be able to gather high granularity data on weights, but if the max weight disparity metric isn’t too large then I don’t think it’s cause for concern, at least for our use-case…

      The other thing I worry about with WRS is that when I derive the weight factors from the raw metrics, I “stretch” the numbers. For example, I want there to be a large difference between the weights of servers with 99% and 95% success rates, much larger than a 10% difference. Most of the requests should go to the 99% server. But if all the servers are suffering a bit (80%, 85%, 90%), I don’t necessarily want all the weight to be on the one that’s only marginally better (90%) even though that’s exactly what I want to have happen in a 99%, 99%, 95% scenario. I experimented a bit with taking the log of the failure rate and grouping hosts into “nines” buckets, but didn’t get anything satisfying. WRS is kind of a blunt instrument in this regard, and I feel like I could do a lot better with a more direct anomaly detection algorithm, perhaps one that explicitly checks for outliers.

      I didn’t follow the bit about binomial distributions (I’ve never been great with stats) but what I find confusing is that you seem to imply that p2c with the same weights would produce a less uneven distribution that WRS. That seems like it might be correct in the short term, but I have concern about long term unevenness (which I mentioned in the post.) Is that something I’ve gotten wrong?

      1. 2

        Interesting to hear that you haven’t seen much in the way of mobbing. I’d expect that with p2c (as you said, convergence is slower), but I’m curious if you’ve seen it with other methods. I wasn’t sure how realistic a concern it was. I think it might have to do with your throughput and latency? I’ve heard concerns about mobbing from people working on realtime systems too, but distributed systems typically have latencies on the order of hundreds of microseconds to milliseconds–that might be saving us from the worst kinds of bursts. We haven’t really done the research to figure out why we aren’t affected though. Marc Brooker from AWS ELB has a good post about how p2c does well in those circumstances.

        I didn’t follow the bit about binomial distributions (I’ve never been great with stats) but what I find confusing is that you seem to imply that p2c with the same weights would produce a less uneven distribution that WRS. That seems like it might be correct in the short term, but I have concern about long term unevenness (which I mentioned in the post.) Is that something I’ve gotten wrong? Probably a good to find someone at your company who has a background in stats to check your work, we’ve found that to be invaluable when reasoning about different load-balancing schemes. I’m not great at stats either, but I’ll try to explain =). We can model the number of requests that a given backend receives under a true random (I’m assuming all weights are the same to make the stats easier, and because it will probably often be the case) load balancer as a binomial distribution, a distribution where you say you have an experiment (a remote peer picking a backend) happens n times, and the experiment “succeeds” (picks this specific backend) with a probability p. It will probably look sort of like a normal distribution. You can imagine a PDF as a dartboard, where each host is like a dart, and it could hit anywhere under the curve for the PDF, and after the dart hits, you check the number of “successes” and that’s your server’s concurrency rate. So really what you want is a super tight distribution around a single number–that would produce a dartboard where every host is receiving the same amount of traffic.

        I see your concern if there’s a persistent difference in health–we’ve found that our remote peers usually are either healthy and basically never return failures, they have a blip where they return failures for a bit then recover, or else they are very unhealthy and decay very fast and perish. We don’t typically run into the case where just one server has a low success rate forever. With that said, there’s a good chance you would simply want to route around that. With that said, the p2c paper does make the assumption that you’re using load as your load metric, which will be evened out by the p2c algorithm itself adding load–since p2c can’t do anything about health (assuming it’s not related to load) that might be more of an issue for you, if you find these persistent health differences to be common in your environment.

        latency (inverse of a decaying exponential average) … concurrency (inverse) one thing you might want to consider is that latency and concurrency (or “load”) are sort of measuring the same thing if your load-balancer is working–a faster server should receive more requests, but also maintain a low level of concurrency, because it can clear requests faster. however, we’ve found that measuring latency is better than measuring concurrency because it means that new hosts don’t get slammed on start-up, and it can automatically slow down when the remote peer starts to slow down under heavy load.

      2. 2

        Oh hey, I really like the Peak EWMA calculation here: https://github.com/twitter/finagle/blob/master/finagle-core/src/main/scala/com/twitter/finagle/loadbalancer/PeakEwma.scala#L52 As I understand it, the weight is adjusted by the call frequency to make a constant-time half-life (that’s EWMA for unevenly spaced time series) but also any measurement over the current average is just taken wholesale (that’s the peak-sensitivity). A coworker had suggested something like the peak-sensitivity aspect to deal with the low-latency failure situation that rachelbythebay calls the “load-balanced capture effect”, but I guess there’s a more general applicability. Very nice.

        1. 1

          We’ve considered having servers advertise a load metric to the client, but haven’t invested heavily in it, because what we have works pretty well so far.

          Load Shedding: We were able to opt every client at Twitter into load shedding by ensuring that we shed load gently, and that we only do it when servers are rejecting traffic at high volumes.

          You get both of these if you use a configuration where the frontend requests a token (or multiple tokens upfront) and you process requests in a Last-In-First-Out (LIFO) reversed manner.

          When a request comes in, the load balancer broadcasts/multicasts a request asking “hey, who can do this work for me?”. Each node delays the response by some function of the load of the system by a number of microseconds. The first reply to the load balancer gets the actual request.

          If a backend is busy, naturally it gets fewer requests. If a backend fails, it never replies its willingless to do work. You no longer need to store state in the load balancer about health (though in practice this is not really a big problem). Moving from offline->online the node can have a forced 15 or so second warmup time to reduce service flapping.

          Another nice side effect is your load balancer no longer need routing logic as the request of willingness to do work can contain meta data (or the full request) and the node can decide to not respond at all if it it not configured to be able to handle that request (think data sharding, or splitting your interactive queries from your batch jobs, etc). Of course whist the node ways for the actual request to come through, it can pre-emptively start processing the request.

          1. 2

            That’s really interesting, but it also entails a bunch of extra round trips and network load–we’ve experimented with giving out leases periodically, but not on each request. Multicast is also extremely expensive–under an aperture scheme it wouldn’t be as bad, but in the normal case for small messages, it would increase your network overhead by a factor of your number of remote peers. The scheme you’re describing of delaying your response also seems like it would deliberately increase latency, which is something we’re trying to optimize for. Since network latency is often on the order of hundreds of microseconds already, you would have to have your delay also be on the order of hundreds of microseconds for it to not just be swallowed by network latency variance.

            I’m also not as optimistic as you about health–in particular, a server might be able to respond happily but still be serving unhealthy responses. There are different kinds of failures, and backends aren’t always good at measuring them.

            I think directly communicating server-load data via out of band requests, or piggy-backing updates via responses is probably more appropriate. As you mentioned, it’s not that burdensome to keep track of load data.

        2. 4

          Really enjoyed this! Super relevant to some of the stuff we’re doing at work currently, thanks for taking the time to research and write this up.

          1. 2

            Glad to hear! I’m hoping to get some of the code I wrote while doing this research into production this month, and then we’ll see if the theory’s any good. :-P (Either way, I should make another post later with what I find, and a usable extract of the code.)

            If you end up trying out a traffic-based approach, I’d love to hear how it works out.

            1. 3

              The bulk of the routing we’re doing is on the database protocol we built for Neo4j, this one I’m not directly involved in developing that anymore, but very much am involved in load balancing with it.. one of the downsides of custom protocols is that you don’t get built-in introspection ability in the big load balancers like you do with HTTP. Maybe it’s time to write some haproxy extensions.. :)

          2. 3

            For transparency, this is also up on my employer’s blog, since this was mostly researched and written on company time.

            1. 3

              This area is really interesting to me since I happen to work for about the only 2 companies that still build this stuff from the ground up (walmart and google).

              One minor addition: Usually to scale out multiple load balancers you use ECMP or anycast to distribute traffic to multiple machines sitting next to each other. DNS is usually used to distribute to a region.

              Interestingly almost everyone in the industry seems to have ended up with a pretty similar stack for load balancing:

              1. anycast DNS/CDN to get customers to close but alive datacenter. External healthcheckers to determine “alive”.
              2. Router doing something like resilient hashing to inline “network” load balancers (google’s maglev, facebook’s katran, AWS’s network load balancer)
              3. Optional: a relatively fast L7 lb like haproxy. Really useful for org-wide metrics if you don’t have homogenous infrastructure around http handlers/frameworks/side cars. Also commonly useful for routing to the right microservice. Apache and nginx often serve similar functions here.
              4. A bunch of vms speaking http2/http.
              1. 2

                Ah, thanks for the note on the networking side—that’s still all a bit hazy to me. I’d observed ELBs (and similar) using DNS load balancing, and knew that there was more than one node behind each IP address, but wasn’t clear on how exactly they were distributing the load in that intermediate segment.

              2. 2

                Nice write up. I think this kinda thing is easy to overlook until it smacks you in the face.

                I came to a similar conclusion after running beam irl. If the whole point of the runtime is fault tolerance, it actually becomes a lot more complicated to figure out system health, since you need multiple vantage points. It seems so obvious in hindsight…oops.

                1. 1

                  I would read the heck out of a blog post titled “Load Balancers Are Distributed Systems And We’d Better Start Treating Them That Way”.