It may be irrational, from a strictly functional utility standpoint, to prefer one’s own service that takes four hours to restore over a cloud service that goes down for two hours. There is the psych aspect, that doing something feels better than waiting. However, there’s also a messaging aspect. At some point, you have to explain the outage to your own customers, and perhaps more importantly, what you did about it. “We filed a ticket and sat around idly” doesn’t sound great. Recognizing the irrationality of downstream is itself very rational. :)
There’s nothing saying you can’t do both. I fondly remember an issue with $RAILS_HOST where our team ended up splitting into three groups: one to try Heroku, one to try EC2, and one to hope our site db was recovered from the cutting room floor and deal with external messaging. Team EC2 won by about fifteen minutes.
We were hit hard by the S3 outage, but we were by no means just waiting, and we certainly have messaging to take back to our customers. For us the story for our customers looks something like:
Our application went down due to an outage in S3 in the us-east region. While the outage was ongoing we made a number of attempts to re-deploy assets to a different region or to EC2 instances. Unfortunately a number of secondary failures in upstream dependencies prevented us from being able to recover (AWS console/cli issues, ASG scaling issues, NPM failures). To mitigate future outages of this type we are taking several steps. First, we’ve updated the caching proxy in front of S3 to handle more error cases, and to fail over to cached data faster, particularly in the case of timeouts from S3. Additionally we’re enabling cross-region S3 bucket replication and using DNS failover to provide redundancy behind Cloudfront. Future S3 failures may result in reduced functionality in non-critical areas, but the core feature of our service should remain available.
The details of what we did during the outage includes more interesting stuff like ssh-ing into the nginx EC2 instances to see if I could extract built artifacts from its cache and trying to find a dev who’s local dev environment was able to create a production build of the app. We managed to get the latter done, but things were starting to recover by that time, so there wasn’t much point in continuing to pursue it.
Then there are also situations when being able to assign priority to repairs is valuable in its own. Or at least getting an estimate for the time to repair the critical path.
Unplanned outages are far worse than planned outages for risk management. And it was prime-time US time. And most orgs aren’t willing to incur the costs for being multicloud because it’s expensive to transition to that, so a very low ROI. ::shrug:: It is what it is.
If you want to guarantee reliability and invest deeply in your infrastructure for long-term reliably, build multicloud from the getgo, using open source components and an interface layer that permits multicloud rebalancing and switching. Not trivial, and it’s mostly a long-term play.
Sounds like there’s room for an open source solution that handles abstracting over the major cloud providers.
It’s not entirely psych if you’re looking at it in terms of number of eggs in one basket. An enterprise’s network/system going down usually just takes that one company down. Maybe makes failures happen in others if they rely on it. Whereas, Amazon’s business model means one of its major, service areas going down is taking down a bunch of companies. That is a big deal in terms of the impact of an outage in one company’s network/systems.
Probably not in the larger picture, though, as most of the large companies we depend on day to day keep their stuff internal.
Part of the hype is that too many people depend on AWS. Thus the more people talking about the problem, the more people think it serious.