1. 12
  1.  

  2. 5

    The EC2 SLA is pretty clear:

    “Monthly Uptime Percentage” is calculated by subtracting from 100% the percentage of minutes during the month in which Amazon EC2 or Amazon EBS, as applicable, was in the state of “Region Unavailable.”

    “Region Unavailable” and “Region Unavailability” mean that more than one Availability Zone in which you are running an instance, within the same Region, is “Unavailable” to you.

    AWS reports status at the region level, so saying the status board “lies” is an overstatement - it reports consistent to the SLA.

    Having said that, you mention “network partitions” and I can definitely sympathize with the difficulty of running a distributed system - maybe AWS can find a better way to highlight when a region is impaired sub-SLA.

    1. 5

      AWS reports status at the region level, so saying the status board “lies” is an overstatement - it reports consistent to the SLA.

      I’m not certain that I agree. The SLA you quoted defines what “Region Availability” and “Region Unavailability” mean, but his complaint isn’t that the AWS status page reported a region available when it wasn’t (because you’re correct, the region was “available” per the SLA’s description).

      Rather, his complaint is that the AWS status page describes the region as “operating normally” (which isn’t touched on in your SLA quote at all), which seems dishonest given that at least part of it demonstrably is not operating normally.

      It would be more honest, and consistent with both their SLA and with the observable set of facts in the physical universe we occupy, if the AWS status page were to describe the region as both Available AND yellow/“experiencing some performance issues”.

      Put another way: if everything better than “Region is flat-out unavailable per SLA” (presumably, red) is green-with-at-most-an-asterisk, why even have a yellow?

      1. 1

        Perhaps so - and again, I’m not suggesting that the status quo is good here; I curse the “blue I” as much as anyone with an AWS workload does.

        I think the best recourse may be to split reporting to an AZ level - so transient network issues in us-east-1d make it go yellow.

        But even then I’d suggest that defining a threshold is difficult in the world AWS occupies: from my experience in working with several thousand nodes in an on-prem data center I’ve learned that a certain percentage are always going to be broken in some way. I’m sure that AWS has a dramatically lower percentage than I - better monitoring, maintenance, etc - but at that scale it’s likely that several orders of magnitude more real servers are fucked - so I’m not even sure a service health dashboard beyond “here’s the health of your environment right now” means much at all.