1. 30
  1. 5

    This is a really difficult problem for all parties involved. IaaS/PaaS providers can only look at certain kinds of fairly general metrics. This includes things like network traffic, database availability, HTTP requests being routed to appropriate processes, etc. Once a cloud provider hosts code they didn’t write or uses services they don’t directly maintain then there is a potential point of failure.

    Typically cloud providers view availability as control-plane and administrative console availability. Are we able to provision new resources and are those resources staying up to the best of our ability? If the answer is yes then we consider our service “available”. A single customer’s database being down or web processes being unavailability typically does not constitute an outage unless it is a wide-spread issue or something that was a result of a misconfiguration or bad code deploy on our part.

    Full disclosure: I work at Heroku and we’ve had discussions about this exact topic numerous times. Please be respectful and don’t throw any flame my way, I’m an engineer and don’t make product decisions (only advocate for them).

    1. 3

      yeah. agree with most of this. a happy middle ground seems to be “give operators enough signals to evaluate health on an individual level themselves, while they implement robustness at the application level.”

      that’s tricky too of course, don’t want to leak details of the magic behind whatever abstraction you’re supplying to clients in the metrics you give. I don’t think there’s a cloud platform that’s doing this well at the moment – closest is probably GCP, I feel overwhelmed (in terms of volume) at times with the amount of SLA violations/outages they inform me about.

      1. 1

        Signals and events is a difficult problem and something I’ve wanted to implement for years. Sending people email during these issues was nice in 2011 but now I feel like an event stream or (at-least) Webhooks would be better to give customers control over what to do in the event of a service disruption.

      2. 1

        The article is not about customer’s services, but provide-run services that are not accounted for: “Then you’ve had a whole bunch of fun outages caused by something going wrong with their services”

      3. 3

        We recently had an incident where occasionally the magic login email from our main application was taking between 8 and 10 minutes to send by Postmark after the application had pushed it. The link expires 5 minutes after being sent so even when the email came through it was useless.

        It didn’t matter that the app had five nines uptime, from our clients point of view it was down intermittently. This prompted me first to extend the expiry on the magic auth links and then to write a service that checked email delivery time with anything over 30 seconds being considered a failure and therefore raising an alert.

        1. 5

          Not trying to tell you how to do your job but I find that relatively unrealistic. Every time you send mails to a wide array of email providers there will always be some where delivery takes longer than a minute. Especially corporate and free mailers (and what’s let then?). So sure, if you can rule out a general minimum delivery of 30s, fine - but alerting because some email arrived only after 5 minutes sounds wrong.

          1. 5

            agreed. Email is not instant, nor should it be. It’s async on purpose. Trying to treat it like it’s instant will cause you heartache. The defaults for SMTP bounces were like 2 weeks back in the day, I bet many implementations still have them set pretty long. Plus you can never ever tell when an email actually made it to a person’s mailbox, let alone when they bother to check it.

            If you want instant/real-time delivery, use a chat protocol. Something designed for fast delivery.

            1. 2

              Your not wrong.

              The thing is, the majority of the time as in 99.999% of the time emails will be in recipients inboxes within 30 seconds, the median actually being closer to 12 with a range of between ~6 and ~120 seconds. Our clients expect this to be the case and anything longer is perceived to be broken.

              In reality extending the timeout on the emailed link is the solution, having something in place that alerts us if emails aren’t seeming to be received within a time frame can alert us to a part of our infrastructure that had until now gone completely unmonitored.

              1. 1

                Good to know some of these numbers, thanks. I’ve not had this problem in a while but apparently the median of hosts has gotten better. Or even more people use GMail and thus you will instantly either run or fail for 50+% of your target audience :P

          2. 3

            Every external dependency is a liability. If your cloud provider messed up and it impacted your customers, it’s still your responsibility. It’s a risk you took when deciding to rely on them.

            1. 6

              They can still provide individualized data so you can reevaluate this risk over time.

            2. 2

              The amount of nines after the decimal points do matter. Over the course of a year – 24 × 365.25 = 8 766 hours –, an availability of 99.999 % is two orders of magnitude worse than an availability of 99.99999 % (8.766 hours vs. 0.08766 hours of downtime). Be wary of that.

              1. 1

                Interesting that this implies monitoring systems where a metric is your customer’s subjective ability to use your system. Obivously you couldn’t write an SLA on a metric of something you’re not fully in control of, but this sounds like an awesome idea in a brave new world where, y’know, people give a shit.

                1. 0

                  I agree and disagree at the same time. I mean, yes, my provider uptime isn’t the same as my uptime, because how it could be. But at the same time you need to monitor only whether services you are providing is “up and running”, you cannot check and monitor all of your customer services (especially when you run code written by them). You need to set clear boundaries where is “you” and where is “provider” and you can only be mad if “provider” didn’t delivered what they said, but you cannot require “provider” to manage your problem only because you are their clients.

                  1. 8

                    I am not sure the argument is “provider needs to monitor my app.” The argument is that “provider needs to give me better insight into the services my app relies on so I can tell if my app is fucked, or provider isn’t providing service.”

                    These are distinct.

                    1. 5

                      In the mainframe and NUMA markets, the big machines usually had so-called RAS features that did things like monitor for reliability. Then, they could fix problems or even move apps somewhere else ahead of the big problem by detecting a pre-condition. The vendors of actual five-9’s systems like NonStop and Stratus also provide tooling like that.

                      Then, the commodity clouds show up saying they have five 9’s except you won’t get those 9’s using the service. Seems misleading. Plus, they can probably spot failures and alert customers if their forerunners from decades before could do it. They’re just not doing it. Another reason to consider one of the older, HA-for-real solutions if one is doing something needing high 9’s.