1. 26
    1. 8

      I wish that we had a better way to refer to this than as “nines”. I agree with all of your points; I wish that folks understood that going from X nines to X + 1 nines is always going to cost the same amount of resources.

      Here’s a further trick that service reliability engineers should know. If we compose two services which have availabilities of X nines and Y nines respectively into a third service, then the new service’s availability can be estimated within a ballpark of a nine by a semiring-like rule. If we depend on both services in tandem, then the estimate is around minimum(X, Y) nines, but should be rounded down to minimum(X, Y) - 1 for rules of thumb. If we depend on either service in parallel, then the estimate is maximum(X, Y) nines.

      As a technicality, we need to put a lower threshold on belief. I use the magic number 7/8 because 3-SAT instances are randomly satisfiable 7/8 of the time; this corresponds to about 0.90309 nines, just below 1 nine. So, if we design a service which simultaneously depends on two services with availabilities of 1 nine and 2 nines respectively, then its availability is bounded below 1 nine, resulting in a service that is flaky by design.

      1. 4

        If we depend on either service in parallel, then the estimate is maximum(X, Y) nines.

        Shouldn’t that be X + Y? If you have a service that can use either A or B, both of which are working 90% of the time, and there is no correlation between A working and B working, then at least A or B should work 99% of the time.

        It’s possible that I misunderstand you, because I don’t understand the last paragraph at all.

        1. 2

          If you have two services with a 10% failure rate (90% uptime), the odds of both failing are .1 x .1 = 1% (99% uptime).

        2. 2

          I had a hidden assumption! Well done for finding it, and thank you. I assumed that it was quite possible for the services to form a hidden diamond dependency, in which case there would be a heavy correlation between dependencies being unavailable. When we assume that they are independent, then your arithmetic will yield a better estimate than mine and waste fewer resources.

      2. 2

        Article says:

        Adding an extra “9” might be linear in duration but is exponential in cost.

        You say:

        I wish that folks understood that going from X nines to X + 1 nines is always going to cost the same amount of resources.

        I’m not sure what the article means by “linear in duration” or what you mean by “the same amount”. That said, your comment seems to conflict with my understanding: because 9s are a logarithmic scale, going from X to X + 1 9s should be expected to take an order of magnitude more resources than going from X - 1 to X 9s, and that’s an important fact about 9s. How am I misunderstanding your comment?

        1. 1

          Let me try specific examples first. Let’s start at 90% (1 nine) and go to 99% (2 nines). This has a cost, and got us a fixed amount of improvement worth 9% of our total goal. If we do that again, going from 99% to 99.9% (3 nines), then we get another fixed amount of improvement, 0.9%. My claim is that the cost of incrementing the nines is constant, which means that we get only roughly a tenth of the improvement for each additional nine. The author’s claim is that the overall cost of obtaining a fixed amount of our total goal is exponential; we get diminishing returns as we increase our overall availability. They’re two ways of looking at the same logarithmic-exponential relation.

          I don’t know what the author is thinking when they say “linear in duration”. I can understand their optimism, but time is only one of the costs that must be considered.

          1. 3

            Tbh, I’m still confused by the explanation. I’d make a simpler example claim - adding an extra nine does not have the same cost. Going from 90% to 99% is close to free. Going from 99.999% to 99.9999% is likely measured in millions of $. (with an exponential growth for every 9 in between) (we may agree here, I’m not sure :) )

            1. 1

              This hasn’t been my experience. I have seen services go from best-effort support (around 24% availability for USA work schedules) to a basic 90% or 95% SLA, and it takes about two years. A lot of basic development has to go into a service in order to make it reliable enough for people to start using it.

          2. 3

            I’m also rather confused by your claim. Are you saying that going from 99.9 -> 99.99 “costs” the same amount as going from 99->99.9, but you get 10x less benefit for it? I think that’s a rather confusing way to look at it, since from a service operator’s perspective, you’re looking for “how much effort do I need to expend to add a 9 to our reliability?” I also disagree that the cost for a 9 (the benefit aside) is at all linear.

            Going from 90->99 might be the difference between rsyncing binaries and running it under screen to building binaries in CI and running it under systemd. Going from 99.9->99.99 is very clearly understanding your fault domains and baking redundancy into multiple layers, geographic redundancy, canaries, good configuration change practices. 99.999 is where you need to start thinking about multiple providers (not just geographic redundancy), automated recovery, partial failure domains (i.e., fault-focused sharding), much longer canaries, isolation between regions.

            The effort (and cost) required to achieve greater reliability increases by at least an order of magnitude for each nine, and to your point, it’s also worth less.

            1. 1

              I appreciate your focus on operational concerns, but code quality also matters. Consider this anecdote about a standard Web service. The anecdote says that the service restarted around 400 times per day, for an average uptime of 216 seconds. Let’s suppose that the service takes one second to restart and fully resume handling operational load; by design, the service can’t possibly exceed around 99.5% availability on a single instance.

              In some sense, the tools which you espouse are not just standard ways to do things well, but also powerful levers which can compensate for the poor underlying code in a badly-designed service. While we might be able to eventually achieve high reliability by building compositions on top of this bad service, we should really consider improving the service’s code directly too.

              1. 1

                I think we’re generally on the same page here: I’m not saying you don’t need to improve your service’s code. Quite the opposite. “Baking redundancy into multiple layers” and “understanding your fault domains” fall into this category.

                There’s also just general bugfixing and request profiling. A pretty typical way of measuring availability is by summing the requests that failed (for non-client reasons) and diving it by the total number of requests. Investigating the failed requests often leads to improvements to service behavior.

                That being said, there will still be unknowable problems: a cache leaks data and eventually takes down the task. You need multiple tasks keep servicing requests while you solve the problem. A client query of death starts hitting your tasks: if you’re lucky, you have enough tasks to not notice, but perhaps they’re making requests fast enough that your entire fleet is downed. Perhaps they should have been consistently directed to a smaller pool of tasks to limit their blast radius.

                You need both a well-written service and systemic reliability. The effort is greatly increased with every 9.

    2. 5

      Some basic math for composing service reliability bc they look needed. De’Morgans law isn’t magic:

      Arbitrary probability X is ln(1-X)/ln(0.1) nines. 0.995 is NOT 2 and a half 9s it’s more like 2.3

      Assuming 2 services A and B which have independent failure modes (treat any shared infrastructure as it’s own service) and P(X) is the reliability of a system as a percentage uptime.

      Requests need both services A and B:

      P(A or B failing) =1-((1-P(A))+(1-P(B))-(1-P(A))*(1-P(B)))

      Basically: covert uptime to downtime, find the odds of either system having downtime. Then flip back to uptime. This is only so akward bc we talk about uptime, but the math focuses on downtime.

      Requests can be handled by either A or B as redundant systems:

      P(A and B failing) = 1 - (1-P(A))(1-P(B))

      = P(A)+P(B)-P(A)*P(B)

      Basically these add 9s together (almost, but not actually)

      If you want your over-arching system to have X nines, shared infrastructure is going to need to be a lot more reliable than component services or it will dominate failure modes.

    3. 3

      Thanks for the post!

      Stripe’s API has had 99.999% uptime, or three 9’s

      I believe 99.999% is typically be referred to as “five nines” in most circles. Was this a typo or a different way to frame it?

      1. 1

        Sorry typo! Too many 9’s for me. Fixed