1. 9
  1.  

  2. 9

    For a deeper dive on exponential backoff and jitter, amazon has a really good article. When we get into circuit breakers, we begin to dive into failure detection, which is a much bigger story–for one thing, many servers don’t know when they’re in an unhealthy state, so clients simply have to guess. Ideally, tools like exponential backoff can redistribute bursts in an adaptive way so that the server isn’t overwhelmed, but eventually you need more heuristics.

    This article addresses one way, which is for clients to guess, and then circuit-break when things seems really bad. This is a useful technique for a few reasons–one is that retry storms can be multi-tier, where each tier retries multiple times before failing, which makes retry storms worse at each subsequent level. Circuit-breaking gets around this, because at the level that’s having the problem, it’ll just give up. This then gets back to the problem of, “How can the client decide if the server is healthy again?”

    One idea we’ve been playing around with in finagle is having the server explicitly tell the client when they’re healthy or unhealthy, through pings, leases, and nacking. This is useful when the server has some metric for deciding when they’re in an unhealthy state, so we’ve been working on building a few different “I’m unhealthy” metrics. We have some work done on checking whether we’re being throttled by mesos, checking whether a given request has already exceeded its deadline (or else if we’re not optimistic we’ll be able to fulfill its work by the deadline) or whether we’re going to do a GC soon.

    This feeds into two other problems–what if only one client is misbehaving. How can we prevent other clients from being affected by this? One solution is to give each client its own cluster, but this is generally inefficient, unless you have pretty sophisticated autoscaling. Another is to provide a global per-client rate limiter, but introducing any layer of global coordination like this can be hazardous. One advantage to this kind of rate limiter is that it can provide a global view of the health of a cluster though–if every client perceives a cluster as unhealthy, it probably shouldn’t do as many retries!

    To sum up: retries are useful because sometimes your request hits a system when it’s in a bad state. It’s just a fact of life. We need retries, but when a cluster is overloaded, retrying can become counter-productive. There are a few reasons why this can happen. The two main cases are unusually bursty traffic, and the other is that your server has simply gotten into a bad state (I’ve fallen and I can’t get up). For an unusual burst, it’s sufficient to smooth it out with exponential backoff, for “I’ve gotten into a bad state” you need to be able to do failure detection to check when a server is unhealthy, and when it gets healthy again. Failure detection is a whole other can of worms.

    Anyway, this is just a brief overview of what we’ve been thinking about with retry storms, but it’s a pretty interesting area, and there hasn’t been a ton of open source work done in it, even though it can pretty easily take down your microservice architecture if you’re not thinking carefully about it.

    If you want to take advantage of twitter’s work here, please take a look at finagle! We handle a decent amount of traffic, and we’ve seen some pretty bad stuff, and finagle is optimized for being fast, high throughput, and robust.

    If this stuff sounds exciting to you, please come work with me on it! My team builds the distributed systems toolkit for Twitter. It’s pretty fun stuff.

    1. 3

      One idea we’ve been playing around with in finagle is having the server explicitly tell the client when they’re healthy or unhealthy, through pings, leases, and nacking.

      This class of approach takes a bit of care, as it’s roughly a control system that only has an on/off switch and is thus liable to have oscillations. Something with a more linear response such as balancing based on how “full” the server is will reduce the death-laser effect (This is where each set of servers that recover and report healthy, will then quickly fall over again when all the traffic is sent their way).

      If you have full control over your clients you have more options as you can trust them not to cheat.

      We have some work done on checking whether we’re being throttled by mesos

      If all of your application servers are out of CPU, you’re already in a major outage. The best action is to start graceful degradation and/or load shedding, while you add capacity or call in a human.

      Good load balancing, graceful degradation and provisioning/capacity planning can help you avoid getting to that point.

      checking whether a given request has already exceeded its deadline (or else if we’re not optimistic we’ll be able to fulfill its work by the deadline)

      Quickly rejecting queries you know are going to fail is always good.

      1. 3

        This class of approach takes a bit of care, as it’s roughly a control system that only has an on/off switch and is thus liable to have oscillations. Something with a more linear response such as balancing based on how “full” the server is will reduce the death-laser effect (This is where each set of servers that recover and report healthy, will then quickly fall over again when all the traffic is sent their way).

        Yes, this is true, and it’s true for practically any kind of coordination–they should only be used as hints. If every node in your cluster is saying, “Don’t talk to me” it might make sense to still send them normal load, but shut off all retries. Load balancing is more important when it’s an individual server that’s having a problem rather than an entire cluster, since it doesn’t really matter which server you send your retry to if the entire cluster is overloaded! This is one of the nice things about nacks, which is that they can be somewhat finer grained.

        If all of your application servers are out of CPU, you’re already in a major outage. The best action is to start graceful degradation and/or load shedding, while you add capacity or call in a human.

        You make a good point, which is that this specific feature probably won’t be relevant in the retry storm case. It’s targeted more toward helping individual servers which are in a bad state. I lumped it in because it’s part of the work we’ve done with failure detection. The whole cluster throttling case isn’t too common if you’re not doing something really silly. Some possible reasons this might happen:

        1. You’ve screwed up your capacity planning.
        2. Every server has gotten into a bad state, because your services are stateful and there’s a bad state.
        3. You accidentally built your service so that it could be destroyed by a poisoned request.

        What’s more likely is that individual servers are having trouble. For one thing, even stateless services are typically not regular, and can temporarily get into bad states. For example, you might have some amortized work that just needs clients to back off for a little while. GC spirals are the most common example of this, where your server spends a lot of time GC-ing, and then starts generating garbage faster than it can clear it away because it’s spending too much time GC-ing.