1. 101
  1.  

  2. 47

    “there is no such thing as The Cloud, it’s just somebody else’s computer”

    1. 31

      Now it’s the next new thing: The Smoke.

      1. 12

        What I originally thought “serverless” meant

      2. 2

        That’s a clever phrase, but like most clever phrases, it’s only true when you load it down with extra conditions, and then it seems weak. For example, does the code have access to local features such as a file system, or to system-level features such a monotonic system clock?

        If the code runs on a platform that doesn’t even permit access to any local file system or reveal whether your threads run in the same city, IMO that’s meaningfully different from “just someone else’s computer”.

        1. 17

          From the other direction, though: can the “someone else” look at all your stuff when it’s loaded and running? Then it’s their computer, not yours.

          It’s a mirror to the DRM problem: how do you hand the cyphertext, the decryptor and the key over to another computer and then make sure that computer only does the things you want?

          1. 9

            If the code runs on a platform that doesn’t even permit access to any local file system or reveal whether your threads run in the same city, IMO that’s meaningfully different from “just someone else’s computer”.

            But you can quite easily easily build such a platform using computers that you own, stored in places that you own. Local resource visibility is a system property that doesn’t have much to do with ownership.

            I propose a simpler discriminator, namely the power switch and rack yanking factor. If you can, at least in theory, walk up to all computer nodes, turn them on or off as you see fit, yank them out of the rack and put them back in, then they’re your computers. If you can’t, they’re someone else’s.

            1. 2

              I suppose what it’s intended to highlight is that in the end “The Cloud” isn’t some magic ethereal thing, but that’s it’s real computer running your code, and that all the things that can go wrong with your computers can also go wrong with those computers – like going up in flames.

              Some of the marketing materials and “hype” surrounding The Cloud kind of give a different impression, and that it “always works” and is “never down”, but that’s not really how it works.

              Other than that, of course it’s all very useful. But it’s a valuable thing to keep in mind. For example, off-site backups in a different data centre are probably still a good idea.

              1. 1

                Isn’t OVH a budget provider of hosting, rather than what’s usually meant as “Cloud” nowadays (AWS, Azure)?

                I’ve not looked into AWS ToS, but I can’t imagine them saying “if we lose a big datacenter, you’re up shit creek unless you have backups”. The whole point of paying for cloud services is to not have to deal with that stuff.

                1. 5

                  Yes. OVH is mostly a physical and virtual computer landlord. They rent out physical and virtual computer, they don’t rent out services that are supposed to be always up. You’re basically renting somebody else computer, it’s the customer responsibility to make their service reliable.

                  However, they did start to offer “OVH Cloud Compute” instances, which — from my understanding — are replicated. However the prices for these are higher than their traditional VPS offering.

                  I wouldn’t be surprised that many of their customer just buy VPSs because they don’t understand the difference between their “Cloud Compute” and their VPS offering, and/or are optimizing for price.

                  1. 3

                    For my own curiosity, I looked up the AWS ToS and found this:

                    13.3 Force Majeure. We and our affiliates will not be liable for any delay or failure to perform any obligation under this Agreement where the delay or failure results from any cause beyond our reasonable control, including acts of God, labor disputes or other industrial disturbances, electrical or power outages, utilities or other telecommunications failures, earthquake, storms or other elements of nature, blockages, embargoes, riots, acts or orders of government, acts of terrorism, or war.

                    It doesn’t explicitly call out “fire”, but it does give them an out. Whether the building burning to the ground would be considered “within our reasonable control” would be up to the lawyers to argue about, I think.

                    1. 1

                      I’m not sure about OVH; I never used them. But didn’t AWS also lose a whole bunch of data a few times? e.g. in 2019 and 2011 from what I can find in a quick search. And then there was the big GitLab data loss.

                      1. 3

                        And? If you following AWS guidelines you would not lose data. The point is to make all the layers reliable and fault tolerant not that you build a bullet proof DC (there is no such a thing).

                      2. 1

                        If you are just running your own services on EC2, then it’s entirely up to you with backups. There is no live migration to an unaffected data-center, or anything like that.

                        Around 2013, I got a free EC2 instance in AWS Ireland region. A week later, the entire datacenter went offline temporarily because a lightning struck a powerstation, or something like that. After a week I gave up trying to get access to it again. No reply from support in that time, but I guess they were busy with the paying customers (understandably).

                        Unless you are a big enough customer, that you can call them up and convince them of spending time on your servers, instead of their other customers (I wouldn’t base anything on that); expect that any single region/DC can disappear at any time, and plan for that.

                        1. 4

                          We used to have a copy paste answer for internal Amazon customers running a service on a single server and tried to open a high severity ticket for a node outage.

                          “A single server cannot cause an outage that has a business impact. Your ticket priority will be lowered and if the data you had on that single server is lost the ticket will be closed”

                          1. 1

                            They didn’t even bother replying that. In fact, they didn’t bother replying anything at all, ever.

                            And no, it didn’t have any business impact, and the only thing lost was the expectation that support was reachable. :-)

                            1. 1

                              What are “internal Amazon customers”?

                    2. 1

                      The Cloud is “just somebody else’s computers where the data is supposed to be distributed over several data centers over several countries”

                      1. 1

                        First rule of the cloud that you have to use multiple AZs, or on-prem terms, datacenters. It is no surprise that OVH has datacenter problems. I think datacenters are one of the hardest problem in IT and smaller companies having a hard time to get it done. For AWS, Azure and GCP there are way more resources that could be invested in figuring out how to create a reliable datacenter and build even more reliable services on the top of these.

                      2. 44

                        In France, a lot of companies rely on OVH to host simple websites or complex Cloud infrastructures. Unfortunately, some are not tech-savvy enough to have backups and Disaster Recovery Plans. I sympathize with those companies that just lost their website tonight.

                        1. 1

                          We probably have enough of extra hours just around here to gather around some sort of sane repository of software and docs on how to do it simply and quickly for the surge of people that’ll wake up from the general panic now.

                        2. 16

                          Just in case you want a visual on that, here is a German article with a picture of the burnt down SBG2 datacentre. Doesn’t look good.

                          1. 2

                            Some people have enough self-mockery to make fun of it:

                            https://pr0gramm.com/

                            1. 3

                              yeah but these guys have offsite backups.. so apparently memes have more technical knowledge and backups

                            1. 8

                              We did a Disaster Recovery test last night, successfully. Even though we always learn something new each time. We had a real disaster recovery scenario a couple of years ago, where one of the data centers where almost flooded. It was scary as hell.

                              People say that a backup which is not tested to be restorable is not a backup, a disaster recovery plan which is not tested, is not a disaster recovery plan.

                              1. 4

                                I wonder what happened to their fire suppression system. My old university used halon gas suppressors to prevent water/sprinkler system damage in the event of a fire. I wonder if they had something in place and it didn’t go off, or if it wasn’t enough to control the fire.

                                1. 7

                                  From this forum post[1] about OVH data centers and extensive use of Google Translate (five years of French in school? can barely remember a lick :))

                                  • Their fire suppression system was a traditional sprinkler system, not a water mist system suitable for electronics rooms or an inert gas system
                                  • Their sprinklers would be manually activated by datacenter operators, not one that automatically responds to a fire
                                  • There was a significant amount of wood used in the construction, and it didn’t look like much thought was given to firewalls (insert data center joke here)

                                  If all those are true, I’d suspect that the fire was either already out of control when the sprinkler system was activated, or that the data center operators didn’t activate it in time out of fear of collateral damage.

                                  1. 1

                                    Dans le cas de Roubaix 4, le Datacenter est fait avec beaucoup de bois

                                    It mentions that RBX4 (another DC > 250km away in Roubaix), the fire was in SGB2 (in Strasbourg).

                                    1. 3

                                      Correct. However, Le Monde did say that the datacenter had wooden floor:

                                      « Le feu s’est rapidement propagé dans le bâtiment. On a mis en place un important dispositif hydraulique, à l’aide d’un bateau-pompe de grande puissance [qui a prélevé l’eau du Rhin], pour éviter la propagation aux bâtiments attenants », a déclaré à l’Agence France-Presse Damien Harroué, commandant des opérations de secours. « Les planchers sont en bois, et le matériel informatique, bien chauffé ; ça va brûler. Ce sont des matières plastiques, ça génère des fumées importantes et des flammes », a-t-il ajouté, pour expliquer l’important dégagement de fumée et la rapidité de propagation de l’incendie.

                                      My loose translation as a native french speaker:

                                      “The fire rapidly spread into the building. We set up water plan, using a high power water pump [which used the Rhin river’s water], to prevent the spread to other neighbouring buildings.” told Damien Harroué, commandant of the rescue operation, to the Agence France-Presse. “The floors were made out of wood, and with the hot computing devices, it is going to burn. It’s the plastic materials which created significant smoke and flames” he added to explain the significant smoke emission and the speed at which the fire spread.

                                2. 4

                                  This took out the puzzle server on Lichess

                                  1. 3

                                    This affected a customer of mine. At least he had a server for backups; setting up everything to work on that machine. It’s a bit underpowered but it’ll have to do for the time being.

                                    1. 1

                                      How long does it take to get everything up & running again from backups?

                                      I moved the project which we worked on from Tilaa to Leaseweb, and just transferring data took 6 hours (at least I could prepared for it).

                                      1. 4

                                        Well, it is the backup server we’re abusing for hosting for now. So the data transfer itself is not a big issue. Restoring databases took ages because the disks are rather slow, but I’ve restored about 8 websites now, and I started around 8 o’ clock, so that took 4 hours, including some getting to grips with restoring. I had set up the server but forgotten a lot of details. And I also discovered my customer had added a new main directory which wasn’t being backed up… Really, you want to practice this stuff so you’re not caught flat-footed when shit hits the fan, and to make sure you have all you need.

                                        I expect the rest of the sites to be less work, but there’s a lot of manual stuff involved.

                                        1. 1

                                          Why did you relocate? Price?

                                          1. 2

                                            Price was indeed one aspect. Upgrading to a better virtual machine at Tilaa was about the same price as a dedicated machine at Leaseweb. The specs for the dedicated machine was waaay beter, so more for the same price.

                                            Another aspect was that we agreed to upgrade at a certain time window, and Tilaa missed that window and I couldn’t reach them whatsoever. A second window was painful to negotiate with the customer, and with a dedicated machine I didn’t have to rely on a third party to press a button.

                                      2. [Comment removed by author]

                                        1. 24

                                          Can you edit your link to not include all of the tracking garbage? https://twitter.com/playrust/status/1369558033559412737

                                          1. 5

                                            Whoops, you’re right. I could not edit it anymore so I deleted it.

                                        2. 1

                                          I wonder if/how this impacts the Tor network.