Anyone who’s worked at AWS knows everything is constantly on fire, but they do manage to keep blast radius small enough and overwork their on-calls enough that the chaos is rarely visible to customers.
How the heck is this viable to them?
It’s what AWS users pay Amazon for, right? Hardware fails, software has bugs, things will catch fire (figuratively or literally). We pay AWS so that Amazon’s workers take care of all that and we don’t have to think about it too much.
It’s just fascinating to me that such a process hasn’t been streamlined at this point, I guess.
Amazon’s whole thing is basically to shave the margins down to nothing and grease the wheels with human misery. It’s working as designed as far as I can tell.
There’s some truth there but this statement also misses the forest for the trees.
What is streamlined?
Not have things on fire all the time?
I don’t think you can “streamline away” disk failures, RAM failures, power supply failures, datacenter cooling system failures, Internet connection failures, and all the other kinds of messy failures which occur when working with vast amounts of physical hardware. And they’re not in control of the software they run for the most part; companies like Netflix can design intelligent systems where a whole bunch of nodes can fail at the same time and other nodes seamlessly take over for the failed nodes, and some workers can take their time to fix the failed nodes whenever it’s most convenient. But that requires fancy distributed software, and one of the core abstractions Amazon provides is that of one highly reliable Linux computer with a fixed, large hard drive and a fixed IP address which never shuts down, and that seriously limits what you can do to engineer your way around downtime caused by hardware failures.
I’m not an expert in this by any means, it would be interesting to hear more specific details from someone who has done operations work for a cloud provider. But it doesn’t seem that difficult to me to imagine why what AWS is doing is a hard problem to do cleanly.
The App Service Plan baffles me. I saw it for the first time a couple of weeks ago because I want to automate some GitHub things using Azure Functions. The normal deployment model for these is a pay-for-what-you-use plan, which has a worst case startup time of a few seconds if your function isn’t already loaded and ready to run in the Functions runner. The jump from that to the App Service in pricing is quite considerable and the jump in QoS is small (and getting a lot smaller as the cold-start latency for the normal Functions service improves). It feels really like a product that exists because one big customer demanded it and was priced to try to convince them that they didn’t actually want it.
I recently tried using the az Python tool for the first time. It really made me appreciate PowerShell. For reference, the PowerShell version is 45 MiB for the latest version. It’s also a lot nicer to use.