1. 28
  1.  

  2. 27

    Remember folks, The Cloud is just somebody else’s computer. :)

    1. 37

      They’re still better at managing it than me.

      1. 2

        Sure, but what if you don’t need something as complex as s3? If I just want to serve static files I can probably manage that just fine - probably better than AWS can manage something as complex as s3.

        1. 3

          Internally I’m sure S3 is super complex, but none of that is exposed to me.

          If you were using the AWS-recommended setup for a static site on s3 (that is, putting a CDN in front of it) then you likely didn’t notice the outage at all (for a few hours you couldn’t post new stuff but existing content is served out of your CDN).

          1. 1

            Had a static site setup exactly this way on us-east-1 and it went down.

            Was able to get a back up working on Firebase is about 30 minutes.

      2. [Comment removed by author]

        1. 7

          although I’m not authorized to speak for Amazon, I can confirm that Amazon has multiple servers.

          1. 7

            There were still other availability zones up for S3–most devs are just not interested enough in HA to use them.

            1. 24

              Regions. All availability zones in east-1 were down.

              For S3, it’s difficult and expensive (you pay again for each region you duplicate into, and the automatic duplication features aren’t up to symmetric multi-region setups) to go multi-region. For most other things, it’s either impossible or seriously kills the benefits of a cloud platform. (You don’t have to deal with renting remote server space or setting up a VPN; all the other technical issues with setting up a widely-distributed system remain your problem.) And finally, when Amazon goes down, your particular outage is probably not the most severe one your customers are facing.

              So yeah, you’re not wrong, it would certainly be possible with good distributed design to have stayed up through this outage. It’s a hell of a lot more difficult and expensive than just flipping a switch, though.

              1. 6

                It’s worth noting also that you can get better uptime even within a single region by not making that one region us-east-1. As far as I can tell it’s the oldest, most crowded, and least stable one out there.

            2. 2

              They could just do good clustering on a reliable OS like VMS did in 80’s. Those clusters often had uptime that exceeded lifetime of current clouds.

              1. 15

                just

                1. 1

                  Compared to all the piles of tech I see in cloud-style deployments that are constantly changing, varing quality, and varying docs?

                  Yes, “just” install this OS on a few boxes, follow manual on setting up clustering, setup networking dide, and you’re good to go. Even consultants that can do it for you available for reasonable fees. Old timers call such tech a “known quantity” where most surprises have been ironed out.

                  1. 3

                    OpenVMS won’t even have x86-64 support until 2019. Not sure it’s a super great candidate for a modern business operating system unless you’re in it out of necessity. (http://vmssoftware.com/pdfs/VSI_Roadmap_20161205.pdf)

                    1. 2

                      Remember I said “like VMS did in 80’s.” I’m not saying you have to use that specific product. I meant it more open-ended. You’d have to be OK with Itanium servers for VMS itself. There’s other clustering solutions out there that are similar. There’s potential for FOSS to clone some of them. And so on.

                      1. 6

                        So I was around in the 80s. The clustering options were super proprietary, fragile, and rarely survived an upgrade. Although we had high reliability systems, we had exactly zero distributed, globally accessible, low latency, highly available, geographically disparate systems. And even fewer storage systems of that type. In fact, I don’t remember anything that could survive the loss of 5U of a rack, much less a whole rack, much less a cage, much less a datacenter. In fact, I could tell you stories about various hilarious attempts to make SCSI reliable that would make your hair stand on end.

                        The reason why you don’t see a proliferation of that type of system today, and in fact reach to find any example, is that it turned out to be an evolutionary dead end. The pattern of using many commodity components and tolerating failure turned out to be far more successful than using a smaller number of highly engineered components that armor themselves against failure. It turns out that in the presence of rapidly mutating state, and arbitrary threats, there are no tradeoff-free solutions.

                        1. 4

                          In other words, nobody read the RAID paper. ;) (ok, so that was 1988, too late to save the 80s)

                          1. 2

                            Appreciate the perspective. Many others told me something different where their stuff was quite resilient with VMS admins saying it the most. Far as 5U, one bank lost a whole site of VMS servers in WTC with failover happening with claimed loss of no transactions. Who knows what story is for remote filesystems.

                  2. 10

                    Is there a modern OS that can provide the strongly reliable clustering you describe at a relatively comparable cost to S3? Honest question.

                    1. 1

                      I havent surveyed them in a while. I doubt it will be that cheap, though.

                    2. 2

                      %CLOUD-F-NTCMFL, cloud network connection failure

                      1. 1

                        that will not help you if it is a network problem.

                        1. 1

                          That’s why standard practice in highly-available clusters is redundant, networking links over different providers. Many used leased lines, too.

                    3. 6

                      If you’re doing it right it’s several people’s computers.

                      1. 1

                        Possession is an obsolete concept. If you can ssh into it it might as well be yours.

                        1. 3

                          I’m just going to assume you don’t run multi-user systems.

                          1. 1

                            Let’s be honest - is anyone running them outside of shell services and webhosting?

                            1. 2

                              Uh…yes? Think universities, engineering companies (real ones, not YCStartupprgonnabegoneintwomonths.ly), research labs…

                      2. 10

                        I do enjoy the “increased error rates” terminology. It’s the IT equivalent of political doublespeak. It’s not an outage, we’re just experiencing increased error rates of around 100%! The AWS status page might as well just be “ALL IS FINE MOVE ALONG” in 72 pt Comic Sans MS.

                        1. 4

                          This seems to keep happenning every few years. If you take reliability seriously, you need to have your infrastructure spread across multiple cloud providers.