1. 14
  1.  

  2. 7

    There’s a missing piece in this . Any retry after the first retry is often going to be useless (>90% of the time). Hence a second retry only ever makes things worse. This is something that you’d measure looking at histograms and based on your actual failure modes of course (YMMV, but I’ve seen very large studies of this behavior at scale).

    Secondly, whenever there is 1 retry, and retries further down the stack, there’s an exponential effect on every single extra retry.

    So the real answer is 1 retry (top level only), token bucketed / with a circuit breaker on the first tries, not on the first retry.

    With well behaved clients that do know how to backoff / token bucket / circuit break, and a server side fast fail on overload approach, you’re not really in a world of hurt with this. But, if you don’t have all those, you introduce things like brownouts where a single failing server that fails fast receives a larger share of traffice due to speed which it responds to failure.

    1. 2

      Yes, this is what I’m trying to get at, only it’s more clearly written :)

      1. 1

        Any retry after the first retry is often going to be useless (>90% of the time).

        Where does the 90% figure come from?

        1. 2

          Really good question - This was an internal metric across a very large amount of systems at a place I used to work (80K+ systems 3M+ servers). I don’t recall the exact metric, but it started with a 9. YMMV - use statistical data to work out your own numbers for this. Think of this as more of a rule of thumb than a specific number.

          Put another way, if you thinking about how failures in distributed systems happen, it’s usually things like overload, misconfiguration, bad deployment, disconnection, hardware failure, etc. Large scale failures were more common based on these factors than intermittent ones. Failing twice is a good indicator for large scale failure rather than intermittent failure. Large scale failure is where you want backoff the most, hence leading to that retry once advice.

          You can look at this mathematically. Say we only retry when we fail and we have two possible failure modes 90% failure or 10% failure.

          • Large scale failure (99% of your calls will fail):
            • success over two calls = 0.01+0.99*0.01 = 0.0199
            • 1.99% success / 98.01% failure
          • Intermittent failure (1% of your calls fail):
            • success over two calls = 0.90+0.01*0.90 = 0.9999
            • 99.99% success / 0.01% failure

          In this example, it’s 9801 times more likely that two failures indicate a large scale failure than an intermittent failure. Maybe your numbers are different and you might consider going to a second retry, but capture some stats around your retry logic for first and second retries and see what they say.

      2. 0

        This completely misses the actual reason for backoff. There are a lot of systems with spiky usage that actually get less work done per calendar second and/or watt as concurency goes up. Any time you can avoid swap, and any time you avoid a machine in a cluster becoming unresponsive, total work done for X inputs will actually go up.

        1. 5

          I don’t see how he missed this point. From the opening paragraphs of the article:

          Backoff helps in the short term. It is only valuable in the long term if it reduces total work.

          Consider a system that suffers from short spikes4 of overload. That could be a flash sale, top-of-hour operational work, recovery after a brief network partition, etc. During the overload, some calls are failing, primarily due to the overload itself. Backing off in this case is extremely helpful: it spreads out the spike, and reduces the amount of chatter between clients and servers

          1. 2

            Seems I missed that.