1. 71
    1. 38

      If you replace S3 with IBM OS/VS2 MVS, AMD with DEC, and adjust the prices for inflation back to the late 1970s or so, this reads a bit like the early experience of the first people who productively replaced their mainframes with department VAXen & co..

      The “popular history” version of how the mainframes got displaced by minis is mostly focused on the part where the minis eventually got so fast they just smoked IBM and what was left of the seven dwarves out of every machine room. But that version is missing some nuance, likely because the computer industry’s progress at the time was, indeed, extraordinary. That bit of nuance is how mainframe vendors (and the time-sharing renting industry, to some degree) helped dig some of their own grave through mechanisms such as:

      • Increasingly diverse licensing and feature options, which started with good intentions (increased flexibility) but devolved into so many disjoint offerings that anything cheap didn’t do everything you needed, and anything that came close was so expensive DEC’s salespeople could use it for target practice.
      • Increased costs in tooling and management + higher connectivity costs which mainframe vendors really didn’t want to swallow, because they were used to the kind of profit margins they’d enjoyed in the sixties.
      • Increasingly uncertain pricing, at the intersection between these two factors; minicomputer installations started to be attractive even past their initial niche in part because you could at least get a reasonably straightforward answer to questions like “how much will it cost us to process 10,000 records/week?”
      • Increasingly shabby software. Our idea of how the department VAX came to be is currently dominated by stories of OpenVMS uptimes under a steady load of number crunching for nuclear missiles, BSD hacking, and tools like BRL-CAD. But in many places, the department VAX displaced a mainframe at the mundane task of filing forms and counting beans (usually with custom software, but that was already changing, too). That happened, in part, because the developer experience they delivered was increasingly poor and, eventually, (what was left of) CDC, Honeywell, Burroughs & co. became increasingly unable to deliver systems that were competitive even for storing text without making you want to tear your eyes out.
      • Increasing inability to support software specialization. Poorer development experience meant that, in time, customers expect faster turnaround for custom-tailored solutions, and they could get it, either in-house or via COTS software, at a rate that mainframe vendors weren’t able to keep up with past a point.

      There are two things that are good about cloud offerings and I think that, despite cloud vendors’ claim to the contrary, they are quite disjoint:

      • Easily scalable hardware resources – this is a given and probably not something that we can get without tl;dr sitting on a chunk of someone else’s datacentre, at least not for the foreseeable future
      • Convenient resource management and deployment – which is something that I hear is becoming more viable to do locally these days, without depending on large cloud vendor schedules, and also with full control over billing, eh, Azure?

      Except where the former is a strict requirement, I have a feeling that smaller players can eat some of Big Cloud’s lunch, not necessarily by delivering some technological coup de grace, but just by not getting high on their own supply.

      1. 9

        The big problems with the easily scalable hardware resources are that:

        1. very few people actually build systems that can throttle down to zero more than once
        2. the premium you pay for cloud resources is almost always more than the cost of running all the hardware all the time

        Both of these problems could be solved, and I’m not entirely clear on why they aren’t being. On the cost thing: cloud providers are competing with each other on other things—maybe it’s tacit collusion, maybe they just aren’t capable of reducing their cost to serve?

        On the throttling down to zero and back thing, this is empirically hard, and I wonder if the reason we don’t collectively get better at it is that in the current environment it wouldn’t really save us any money. But I also wonder why it’s hard in the first place. Some problems (like storage) obviously don’t scale up and down very neatly, but most people separate those out. I’d really like to seed a discussion with some theory of why we’re so bad at this, but…

        1. 4

          There are a lot of factors that come into play in all this and I honestly think some of them are technical. I don’t have the papers at hand (I was interested in this subject on account of $work a while back but that’s no longer quite the case) but there was a lot of interesting black magic related to latency and congenstion management in large datacentres being published a while back, and some of it involved throwing a bunch of stuff we learn in Computer Networking 101 courses out the window. It’s a completely alien world that has to solve problems which are completely outside the realm of networking or storage at normal scales.

          But there are also organisational factors that inevitably come into play when you’re tl;dr building infrastructure that you rent to others under tremendous time and cost pressure. The amount of services that things like Azure cover is staggering, they give you everything from SIEMs to logging and from batch CPU processing to GPU compute. The sheer organisational effort required to develop and coordinate all that is enormous and it’s being amortized over an increasingly static customer base. This became particularly more accute in the last few years, as the adoption boom slowed down and cost control became an integral part of providing the highly sought-after growth.

    2. 18

      I’ve been doing some back of the napkin math on my company’s cloud transition and containers are incredibly expensive. (An order of magnitude more expensive than virtual servers!)

      The joke of “Kubernetes was the Greek god of spending money on cloud services” is pretty accurate.

      On the other hand, increasing our headcount is more expensive than containers. We actually save money this way. And we’re unlikely to grow our headcount and business enough that switching to less expensive infrastructure would be cheaper in the long run.

      1. 12

        … I’m confused, how is “adopt containers in a ‘cloud’” an alternative to “hire staff” ?

        1. 7

          Depending on scale, you need to have skills and hours for:

          cloud containers:

          • containerization
          • orchestration
          • cloud networking (high level)
          • cloud security
          • access management

          physical hardware in a datacenter:

          • hardware build/buy, deploy, monitoring, maintenance
          • network setup, deploy, management, monitoring (low-level)
          • security
          • access management

          If you think that one of these requires skills and hours you don’t currently have, and you do for the other, then you need to hire people.

          1. 9

            Ah yes, the old “it’s the cloud or break ground on your own datacenter, there’s no in between” trope.

            1. 20

              That’s uncharitable. Everything I attributed to “physical hardware in a datacenter” applies equally to renting rackspace from an existing colo provider… which is what my employer does.

              You can also lease servers from many datacenters, pay them for deployment, and pay them for networking.

            2. 11

              It took me a while to figure out what parent is getting at but I think it’s a matter of walking a few miles in young people’s shoes. All this is happening in 2023, not 2003. Lots of people who are now in e.g. their late twenties started their careers at a time when deploying containers to the cloud was already the norm. They didn’t migrate from all that stuff in the second list to all that stuff in the first list, they learned all that stuff in the first list as the norm and maybe learned a little about the stuff in the second part in school. And lots of people who are past their twenties haven’t done all that stuff in the second list in like ten years. Hell, I could write pf and iptables rulesets without looking at the man pages once – now I’m dead without Google and I woke up to find nftables is a thing, like, years after it was merged.

              It’s not a dying art (infrastructure companies need staff, too!) but it’s a set of skills that software companies haven’t really dealt with in a while.

              1. 2

                I’m actually more skilled in running servers than containers. My company is transitioning to the cloud and I’m getting the crash course on The New Way. Docker and Dockerfiles are the currently bane of my existence.

                But I can’t ignore that containers allow a level of automation that’s difficult to achieve with virtual or physical ones. Monitoring is built in. Monit or systemd configs aren’t needed anymore. They’ve been replaced by various AWS services.

                And frankly, we can push the creation of Docker images down the stack to experienced developers and keep operations headcount lower.

                It’s more efficient to hire a developer like me who works part time on devops than hire a developer and an devops person.

                1. 1

                  I’m 100% not an infra guy so I’m probably way off but my (possibly incorrect) expectation is that a company that’s running cloud-hosted services deployed in containers & co. at the moment would also deploy them in containers in a non-cloud infrastructure, too. I mean, regardless of whether that’s a good idea or not in technical terms (which I suspect it is but I have no idea) it’s probably the only viable one, since hardly anything can be built and ran in another environment today. IMHO you’d need people doing devops either way. Tooling may be “just” a means to an end but it’s unescapable and we’re stuck with the ones we have no matter what we run them on.

                  That’s probably one reason why gains like the ones the author of the article wrote about are currently accessible only to companies running large enough and diverse enough arrays of services, who probably need, if not super-specialised, at least dedicated staff to manage their cloud infrastructure. In that case, you’re ultimately shifting staff from one infrastructure team to another, so barring some initial investments (e.g. maybe you need to hire/contract a network infra expert, and do a lot of one-off work like buy and ship cabinets and the like), it’s mostly a matter of infrastructure operation costs.

                  Smaller shops, or at least shops with less diverse requirements and/or lighter infrastructure requirements that can be (mostly?) added to the developers’ plates aren’t quite in the same position. In their case, owning infrastructure (again) probably translates into having a full-sized, competent IT department again to keep the wheels spinning on the hardware that developers deploy their containers on. So they’d be hiring staff again and… yeah.

            3. 1

              I mean, there are other options where you rent VMs or even physical servers, but those require additional skills as well that you have to hire for. If you’re alluding to a PaaS then you won’t need additional headcount, but you may well be spending more for your resources than you would in the cloud.

              1. 3

                I’m coming at this with quite a bit of grey in my beard, but it makes me profoundly uncomfortable to think that the folks who are responsible for all of the cloud bits that “dsr” outlines would be uncomfortable handling the physical pieces. I get that it’s a thing, but having started from the other side (low-level), the idea that people are orchestrating huge networks without having ever configured subnets on e.g. an L3 switch… that freaks me out.

                1. 4

                  Fun, isn’t it? I don’t (usually) feel like I’ve been at this that long, but a lot of fundamentals that I’d have expected as table stakes have been entirely abstracted away or simplified so much that people starting today just aren’t going to need to know them. (Or, if they do, are going to need a big crash course…)

                  OTOH I spend a lot of my time realizing that there’s yet another new thing I need to learn to stay current…

                2. 3

                  I feel attacked xD

                  More seriously, I love programming, but years of family and friends asking me to help with their network issue over the phone or text just completely killed my will of doing this kind of configuration.

                  The exception being terraform, I was pleasantly surprised by how satisfying it was to be ablee to declare what you want and be able to inspect the plan before executing it. But that’s still pretty high-level I guess…

            4. 1

              I think even when colocating, you still are needing some extra level of expertise. There is most definitely more people who can get by with cloud hosting stuff who would be more overwhelmed by the issues coming with managing the hardware.

              I think that if you have people in a team with that skillset, though, then it’s a different calculus. But it’s hard to overstate how little you have to think about the hardware with cloud setups. I mean you gotta decide on some specs but barely. And at least in theory it lets you ignore a level of the stack somewhat.

              Most companies are filled with people who are merely alright at their jobs, and so when introducing a new set of problems you’re looking at pulling in new people and signing up for a new class of potential problems.

        2. 2

          You need slightly less people if you don’t have servers (virtual or otherwise) to monitor and maintain.

          As annoying as I’m finding The Cloud, containers natively support automation in a way servers do not. Linux automation isn’t integrated, it’s bolted on after the fact.

          It’s easy to mistake something you’re familiar with as being simpler than something you aren’t.

    3. 8

      And we’ll have much faster hardware,

      That’s one of my biggest problem with the cloud providers. The hardware you get is so limited, you get so much more CPU and ram on a real machine. My ideal is still renting hardware: you don’t have to spend nights in the datacenter hanging boxes, but you do get powerful machines.

      1. 15

        Hetzner provides hardware rental at reasonable pricing.

      2. 6

        Whenever I look at the pricing for cloud stuff, it’s the storage that I find really difficult to swallow. I guess the reason is that it’s more difficult to move and impossible to over-provision but it’s still incredibly expensive.

        1. 10

          You can buy a (reasonably good) 1 TB SSD for $80 right now, which is around $0.08/GB (and that’s not even the best $/GB). A place like DO will sell you storage at $0.10/GB per month, and there are providers which charge even more. Now, I know this is managed block storage, and you’re paying for the ability to scale your disk without copying everything over, but that’s a huge premium.

          1. 6

            Yeah, at 100 times more expensive per month, even when factoring redundancy (let’s says x3) and multi-location (let’s say x5), it’s still 6 times as expensive, per month. The main cost I cannot easily compute is the host for the SSDs since they need to be attached to the same machine but at such prices, either someone could come and plug the SSD, or they could actually pay a huge machine with only the storage costs.

            It seems hey had S3 (not the same performance as local SSD) at something like 1M for 8PB, which means each terabyte would cost 125k. It’s more reasonable at only 10 times more expensive per year than a raw disk. I’m not sure low volumes pricing for S3 is the same however.

          2. 2

            Power to run things costs too. My (extremely generous to DO) calculations suggest each terabyte SSD could cost them $5 / mo to run (of the $100 / mo they bill it out at).

          3. [Comment removed by author]

        2. 7

          For me it’s egress costs. It’s insane. Quoting myself from the bird site:

          “Reminder that for an on-demand t4g.nano instance on AWS, the cost of 3TB/month of egress traffic to the internet is 100x the cost of the compute”

          ~$2.5 for compute, $250 for egress.

          1. 5

            AWS egress is so steep it’s driven architectural changes at work.

          2. 2

            You’re not even kidding. That’s incredible. Absolutely bonkers. A datacenter near me offers a 10gbps unmetered upgrade for $200/mo, which I know is on the cheap end, but still.

          3. 1

            And 1Mbps billed at the 95th percentile should be around 1€/month now (at most). For 3TB you’d need 10Mbps, which means around 10€ per month. Bandwidth in Australia seems much more expensive (20x), maybe at 200€ per month. But I’m sure AWS already bills Australian bandwidth at Australia prices.

            I’m seeing other figures which put costs of 100Mbps at 50 GBP and 1Gbps at 300 GBP. I doubt Amazon still counts in Mbps.

      3. 1

        The big cloud providers offer instances up to around 64 cores with over 1 TiB of RAM (and up to 4 high end GPUs). You can buy faster machines, but few people do (or need to).

    4. 7

      I get that they want to promote this as a thing they’re doing that’s novel, but I don’t think people should take that bait.

      Dedicated hardware and non-cloud providers trade flexibility for costs. It’s as simple as that. It makes sense in some situations to do this! A lot of companies don’t have that consistent capacity planning, and definitely don’t have that consistency on a ~5+ year time horizon.

      For them cloud still makes sense, even if on the margin they’re spending more for that flexibility.

      1. 9

        I think the major point is that the flexibility angle could be overplayed.

        If you have a SaaS with relatively static sizing, you can afford to triple your capacity and sit at mostly idle all the time and still save massive amounts of money.

        That strikes me as wrong, the cloud is likely mis-priced.

        If people really have lost hardware management skills in their entirety (as in, not even being able to connect to an IPMI on fully managed hardware) - then we truly are serfs to our landlords now and they can charge whatever they want in perpetuity, its the ultimate drug dealers deal (something free to get you hooked).

        Whether you genuinely need more staff to run bare metal remains to be seen, but I am interested in the result. Running on a cloud with a wide scope also has a headcount price associated (you almost certainly have staff who’s job it will be to write terraform, audit billing, maybe even spec your quotas and so on).

        1. 2

          That strikes me as wrong, the cloud is likely mis-priced.

          absolutely this… long term things will even out, i think at this point AWS is just so far above everyone else that the profit margins are just way higher, making things like the original post feasible for more workloads than are even intuitively cloud-native

    5. 6

      I’m curious about what storage stack they are using to replace S3. Highly-available setups like Ceph and Gluster are picky in their feeding and caring, which I guess could be fast-tracked with a few consulting sessions.

      1. 6

        Ceph and Gluster are picky..,

        +1. Of all the services to eject, S3/GCS are the hardest because storage is tricky to get right. Ceph and Gluster work fine, until they don’t, and you usually need some in-house expertise.

        Apart from those, I’ve used Minio in my homelab, and while it’s been pretty nice I’ve also had some real rough upgrades. It’s a homelab, so I don’t feel too stressed about it, but if I were running my business off it I’d be a little more worried.

        None of this is impossible, but it’s easy to forget that the cloud basically solved storage (at a premium).

      2. 5

        Paid onprem storage solutions can be pretty solid. Our last NetApp bill was about $400 per terabyte. “S3 Standard” is $276 per terabyte per year, so we’re saving money if the disks last just 2 years.

      3. 3

        minio is magical

      4. 2

        what storage stack they are using to replace S3

        They aren’t replacing S3 just yet

    6. 5

      I look forward to finding out what they’ll be running on the bare metal this time. A conventional distro like Ubuntu or Debian? a VM hypervisor like vSphere or Proxmox VE? An immutable container-focused OS like Flatcar? And will they be doing OS installs on boot drives, or doing some kind of PXE boot setup? They’re now going a level below the abstraction of ephemeral, easily replaced VMs that they currently have with something like EC2, so these choices matter now.

      I also wonder what they’ll be using for storage, e.g. ZFS, LVM, or something else. In typical cloud deployments, the durable storage is all managed by the cloud provider (through things like RDS, S3, and maybe EFS), and the ephemeral VMs can just have smallish ext4 root filesystems.

      1. 7

        And don’t forget problems like “my VM/container/whathaveyou crashes every couple of days on this new hardware we’ve got, nobody has a clue why, and there is nobody else to pass the buck to”.

        We do something like this (custom Debian-based PXE-booted in-memory OS that runs VMs for CI) and while we would never have been able to afford to run this on the cloud and we are able to run it on hardware that’s not available on the cloud (like M1 Mac Studio for aarch64 builds), it’s not all roses. For example, the Mac Studio has a USB controller that requires firmware loaded during boot and there is some race that causes it to get wedged once in a while after reboot and which requires cold boot to recover.

        1. 6

          And don’t forget problems like “my VM/container/whathaveyou crashes every couple of days on this new hardware we’ve got, nobody has a clue why, and there is nobody else to pass the buck to”.

          Eh, my experience with running Linux on commodity x86-64 server hardware, particularly rented from a provider like OVH, is that it just keeps running, and if there’s a reliability problem, it’s something I did wrong at a higher layer of the stack.

    7. 7

      So apparently they’re not paying their staff? And all that hardware will just all keep working for all those years?

      1. 18

        It says in the post: “Without changing the size of our ops team.”

        1. 6

          right, which reeks of magical thinking

          1. 1

            I mean who knows, maybe they have 2 sysadmins

        2. 4

          And their amortisation goal of 5 years is quite conservative, I’d say.

      2. 3

        They are moving from cloud to renting space at Deft. I guess if something goes wrong, they pay a $100/hour support cost at worst.

    8. 4

      FWIW, I always thought what the cloud bought you was flexibility. As a startup, being able to reinvent your infra was a competitive advantage. I don’t see it making as much sense for a larger organization, unless you’re paying to break established bureaucratic nightmares where it’s impossible for teams to get resources. Again, flexibility.

      Use of cloud infra is also a good way to invest in a HA strategy. Getting HA in your own DC(s) is tricky. Spinning up critical services across multiple providers is tricky, but easier than that.

    9. 3

      A non-cloud deployment seems a lot more reasonable than you’d think given how little you hear the option talked about. (As context, the app I work on started on-prem and moved to AWS years back, and other parts of the larger company we’re part of still have space at a datacenter.)

      You need to overprovision for flexibility/reliability. You need some staff time for the actual maintenance of the servers; it may help to keep your physical deployment simple, a different thing from its size. You also need to adjust to the absence of some nice convenient cloud tools. You can still come out fine even given all those things.

      Echoing another comment, you can get some pretty incredible hardware outside the cloud–Amazon’s high-I/O local-SSD hardware lagged what you could get from a Supermicro reseller for as long as I was tracking it.

      We’re now pretty committed to AWS. We integrate S3, SQS, ALBs, Athena, etc. and some features/processes count on the easy replaceability of instances. Flexibility is also useful in less tangible ways. I’d also note this blog post shares the common weakness of talking up expected upsides of a change before dealing with the downsides.

      Still, I don’t at all think the non-cloud approach is unreasonable. In a way, it’d be neat to hear about some successes outside the major clouds, and I wouldn’t mind more people taking another look at other hosting options, both because it could make some cool things possible and it could nudge the cloud providers to be more competitive.

    10. 3

      DHH is a bold bloviator, and he won’t stop talking about this while it’s working out. If he stops talking about this, you’ll know it didn’t quite work out.

    11. 3

      This post will make sense in three years time, when they have been through the experience. Right now reads more like a cross between wishful thinking and I-told-you-so rant.

      I’m impressed by the fact that DHH is open about his decision making process and shares high level numbers.

    12. 3

      Good luck, I’m all for selfhosting (or less SaaS) - but it can be the RIIR death, simply because you might lack the competence or resources.

      1. 8

        The problem of requiring competence and resources doesn’t just go away with cloud setups. But it really depends on the details. Wrong assumptions and forgotten things are common even in bigger companies with reasonably sized SRE teams. Sometimes this is caused by abstractions of cloud services, where you have to sift through bits of information, that aren’t necessarily part of official documentation.

        Both cloud services and own setups work fine as long as nothing unexpected happened. But it’s not like cloud magically does better once things break. Sure, you outsource parts of that to the cloud provider, but that doesn’t mean the problem goes away, that your corporation has priority, that it can be quickly fixed, that it’s magically bug-free, and so on. And while cloud providers might have more people (though they still want to optimize costs) they also have and need a many, many times more complex infrastructure and so on.

        I think clouds are often mystified as magically solving all problems. At the same time many cloud service benefits can relatively easy replicated. Of course it depends on what exactly you do, but I think some things considered benefits of the cloud are easy to gain if you design your applications within the limitations of clouds and containers (like how you manage state).

        And then you can compare the costs of hiring someone to do it or outsourcing it to a cloud provider.

        There is a reason why SREs are well-payed.

        And if you think about Terraform it’s pretty much a format to create specifications for the infrastructure you require.

        1. 2

          doesn’t just go away with cloud setups

          That’s for sure. But you will find many more resources on “how to host X at cloud Y”, and also support from the company, than for doing this bare metal. And if it fails, it’s doubly hard, because everyone and their pet will tell you that not going the cloud way was the actual mistake..

          Cloud setups can fail for sure, and companies may underestimate that you actually also need competent people here - but at least ordering non-faulty hardware (testing) and sizing down is very easy for them.

          1. 2

            I’ve had hardware failures on the cloud. Quickly resolved though after contacting support. But I’d expect that from any hosting provider/data center. Hardware failures can be strange through the abstraction, so especially when we are not talking about compute instances it’s hard to tell what the issue is.

            As for “how to host X at cloud Y”. I don’t know. Might be, but I don’t really read those and I’d strongly suggest people to not rely on some random tutorial off the internet for those things. Even with best intentions there is a big chance for errors. Or miscommunications like something being an example non-production setup. Or simply not knowing better. I’ve had a point in my career where I took over from a developer without any real knowledge in how systems work, thinking that it would be enough to follow some Ubuntu tutorials. That wasn’t much fun to clean up after.

            Just in case someone in that position reads this: Tutorials like these are usually introductory and while they might be interesting if you have never done any infrastructure work I’d suggest avoiding hosting stuff until you are more comfortable and really understand what you are doing. If you start out, start out with one thing, until you know how it fails and how to deal with such failure. Go for simple, minimal solutions, not for that big black box that does a million things and you not knowing what they really are.

            Once you do understand what you are actually doing hosting X on Y are probably not the most interesting anyways. I’m not completely sure what you are talking about here though, I am thinking “run currently trendy framework on Heroku/AWS/…”.

            Regarding blaming stuff on “not going the cloud way”. I didn’t really have that happening yet. I professionally have done and do both. Both with success. In the end it’s about people knowing what they do and choosing the right tools. Cloud work earns more money though, both the hosting companies and Sysadmins/DevOps Engineers/SREs. And of course pointing at some “look, their service is down” status screen with other notable companies being effect is probably psychologically easier. Even though in most situations you’re still to blame for not having an alternative, some plan for this event.

            All of that really depends on context though. For example I think Heroku, Fly.io and so on are really great if you want to just run your server side rendered SPA or prototype setup somewhere. It’s a way better option, especially if you have zero knowledge on servers. You can even build a business on top of it, but it might end up being very expensive or limited as you grow.

            At the same time I have had clients that were running some EC2 instance somewhere, nobody really knew what they were doing. I’d end up with a consulting contract only to see that someone many years ago followed some guide and the whole OS has been EOL’d since that person left. What they get out of the cloud is and overpriced version of some cheap vserver and what they would want is is just renting some classical web space, but they were talked into “hosting in the cloud”; because it’s “better” and “modern”, more reliable which one needs for business. Of course that’s just an anecdote, but there’s more complicated versions of that story if bigger companies doing the same thing at a larger, “more professional” scale.

            I’m not trying to convince anyone to move off the cloud though. What I am not a fan of is when people act like it’s magically more reliable, bug-free, or better. It’s a big industry and I think a lot of misunderstands probably found their way into people’s minds through “good” marketing, and people repeating bold general statements, without real life, technical base for that. The classical example is “it has to be bad, because it’s what people did in the past”, usually supported by calling everything else “legacy”.

            There’s many more people thinking they benefit of some form of cloud hosting than actually are. Of course that’s also not unique to cloud offers. And I certainly don’t blame anyone for that. It’s just not that general truth that it is often portrayed as.

            Of course a goal here are vendor lock-ins and dependencies in projects. I am happy though that a lot of this seems to start to fade and I hope that is something that will get better over time. It would be nice to have the majority of code/binary size not be libraries and SDKs for each and every cloud provider out there, because they do things slightly different. Some sort of standard would be nice. I know people are directly and indirectly working on that and I’m a fan of how many tools and services, Minio, SeaweedFS, Backblaze, etc. now have S3-compatibility without being dependent on AWS. It would be nice if that would find its way into other things than just object stores.

            There’s a number of positive side effects here. One is that building self-hosted replacement is an option. Competition can grow, which brings innovation and probably better pricing. Another is that one can build setups not fully dependent on a single company. If you run something critical keep in mind that cloud providers do have outages of some services and likely will continue to, as time passes. And of course things from pricing changes to deprecations can potentially hit companies critically in some situations. So it’s a good idea to not just have your regular disaster recovery plan, but also the option to quickly move to either another provider or a self-hosted solution, if necessary.

        2. 2

          but that doesn’t mean the problem goes away

          Catastrophic hardware failure kinda goes, though. Even with the simplest cloud setup imaginable, a couple of VMs with a load balancer. If your image automatically starts the app, whole servers can catch fire and you won’t notice, one vm goes down, another is started.

          With a physical setup, a catastrophic failure in a disk could bring you down for hours, if you don’t have a very robust setup. And even if you do, you still have to go there and fix it.

          Not saying this is a determinant factor either way, it’s just one that I haven’t seen mentioned much in this whole thread.

          1. 3

            You are comparing an HA setup with a non HA setup, not a cloud setup with a non cloud setup.

            If you have a single server, no backups your problem on AWS is just as big as it is everywhere else.

            Underlying hardware of cloud providers falls and when it does and you had data on there you can’t lose you’ll still have a problem.

            An OVH Cloud data center caught fire some time ago and people who assumed that data is magically secure were surprised it wasn’t.

            You can have a raid setup, then a single disk won’t hurt your, you can have multiple instances and a load balancer, then it’s the setup that you described, you can also have cross-network redundancy or cross-datacenter, but that’s not default für a single instances on the major cloud platforms.

            Having instances fail on a large deployment case also happens on AWS, you can create a new one, even automatically. Sure. But data wise it will be gone.

            Physical servers can be kept around or at a hosted can be obtained and deployed within minutes. But it really depends on what exactly you want to outsource. There’s a difference between running your own data center. You can also gave non owned dedicated servers or you can run a fleet of them and use one of the many solutions hosting your own cloud setup with fail over, etc. taken care of.

            It really depends on what your goals are.

            As you say this might not be the determining factor. And one should not forget cloud providers also have other kinds of outages.


            I’m not against cloud usage, but I think there’s widespread misunderstandings of what you actually pay money for and how things work out what issues can arise or can’t arise. The term is probably also just very overloaded.

            1. 2

              You are comparing an HA setup with a non HA setup, not a cloud setup with a non cloud setup.

              I was trying not too, but I could have been clearer.

              Yes, you can get HA on prem, but look at how many extra layers you have to manage now. If you’re spinning ec2 VMs with an autoscalling group and persisting to S3 and RDS, you will need a full availability zone outage to get downtime (from AWS. You can still fuck it up yourself, of course), and even then you might not lose data, if you’re backing up to S3.

              With onprem, you’ll need to:

              • setup raid so a single disk failure doesn’t kill a server
              • setup multiple servers so a single server failure doesn’t kill the app
              • If, or rather, when, hardware fails, you need to replace it yourself
              • you need to setup your own monitoring, and then make that HA, if you want to even know what failed.

              You don’t need to do any of this in the cloud. Maybe it’s not worth the extra cost, but on prem is more work to get somethings that are table stakes on cloud.

              Update: now that I wrote it, I might be biased because I know cloud more than on prem. It’s reasonable that if you have on prem experience, all this sounds more trivial than it does to me. But as other people mentioned, this kinda skill is getting harder to even acquire, so, there’s that.

              1. 2

                (sorry for the long response, wanted to make it short, but then went into trying to explain where I am coming from and elaborating why it depends on a concrete situation)

                setup raid so a single disk failure doesn’t kill a server

                Which is usually the default. In the sense that it’s hard to get by a server that doesn’t either have hardware or software raid (or at least a disk setup to easily do so).

                setup multiple servers so a single server failure doesn’t kill the app

                Where the setup part is trivial, just like the AWS one. One is usually a Terraform config, the other an nginx config.

                If, or rather, when, hardware fails, you need to replace it yourself

                That’s kind of true. Depends on the specifics though. If you rent a dedicated machine, you will both save on the costs compared to cloud products and have it replaced for you. Failing hardware is something you should be prepared for, but luckily is also on the rarer side in reality (just like luckily AWS requests failing, timing out, causing issues luckily is on the rare side). For example when renting servers at a bigger hosting company and you see for example in Grafana that one of your servers has a disk failing you write a support ticket and an hour later is solved. The same is true if it’s packet loss side, other hardware topics, etc. You are right, you have to have it monitored. But once it is set up the maintenance tends to be on the smaller side. And since you need application related monitoring as well I don’t think of it as a huge additional effort. It’s mostly adding node_exporter and additional alerts.

                you need to setup your own monitoring, and then make that HA, if you want to even know what failed.

                This is kind of true for the cloud instance setup you describe as well. You’d typically add health checks, just as you’d do in your local setup. So it again boils down different config files.

                Or are you talking about making the monitoring HA? Again depends on the setup, but you will know quickly in most scenarios. Finding out what failed which sometimes isn’t completely obvious, I’d consider the main job of an SRE. So you certainly have that topic in cloud setups as well.

                Update: now that I wrote it, I might be biased because I know cloud more than on prem. It’s reasonable that if you have on prem experience, all this sounds more trivial than it does to me. But as other people mentioned, this kinda skill is getting harder to even acquire, so, there’s that.

                Makes sense. Of course it’s skill based and you should not set up anything for production you have no skill on. That’s how al the “customer information end up on the public web” situations emerge. It might be worth acquiring these though, because it can make sense to mix and it often helps debugging issues on the cloud as well, or when contacting a cloud provider’s support.

                I think the example you gave with backing up to S3 is a great example of mixing. It totally makes sense to even if you do on-prem setups to back up to S3/B2/GCS or anything else that can at least be configured to not only have one copy. And something that actually is able to counteract and not just detect bitrot (like most raid setups).

                (from AWS. You can still fuck it up yourself, of course)

                Even though you just wrote that in braces, I think that is an excellent point. If you took somewhat reasonable precautions (eg. have backups and some form of disaster recovery plan) you will likely not end up in a situation where some infrastructure related problem will bring you down. The same can be true with wrong assumptions (the bitrot topic). If the setup is getting more complex an issue with changes there or in the application tends to be more likely. Not looking closely enough, or not having a peer to review and your new app version or your Terraform change tears could cause a disaster - both of which you have to have the experience if you want to avoid it or reduce impact.

                But that’s what I really want to go for. It’s the blind assumption that these problems will be solved or will be super hard in a non-cloud setup. People built reliable services, even in their free time before the cloud came along. And that was a time when we didn’t have many easy to set up services from monitoring to backup, and when many applications were still very mixed with state and it was hard to get an overview and oddly enough very few people managed configuration properly. I think a lot of these things people associate with cloud setups, when most of these parts are easy to use, highly reliable and quick to set up. Your Grafana + prometheus + exporters setup is done quickly, load balancing web applications, creating images (and/or using configuration management systems) is also just as possible, if you have state in a database and use MinIO’s or SeaweedFS’s S3 compatibility or even outsource that part and use S3 or B2 directly you can recover quickly and if you either have a hoster where you rent services or have backup servers or a way to quickly get new ones up and running that also works. Or for some others a private cloud on owned hardware solution might be what they want.

                So where I want to go with that is it’s not the only option and what makes sense really depends on what you run and what you need. For example, if you have a product that has high constant load, where you process something and sell the output of that you might go a lot cheaper just running that on dedicated hardware, maybe upload the result to some kind of object storage (your own, S3 or others) and for example run you customer interface potentially even at some web hoster. That might hugely increase your margin compared to just dumping everything into the cloud, maybe even end up dealing with regular upgrades that would be unnecessary if you didn’t use a service whose API can essentially change at any time.

                But that’s just one example. I think the topic becomes more interesting when you have an existing company with multiple services and products. In the simple web service case if you just start out, like I wrote I usually recommend something in the Heroku/Fly.io or maybe Google Cloud Run or other “serverless” solutions. But if you start pretend that it actually is really serverless, think you don’t need a disaster recovery plan or that it is cost-effective you’ll likely run into troubles later on.

              2. 1

                Just to name something different:

                With onprem, you’ll need to

                Throw something like PVE on it, use their Ceph integration and buy at least 3 machines. Then you have always 3 machines with a full copy of the data. And you have automatic HA where hosts can die and everything moves on. It’s a little bit more complicated if you want 0s downtime, but you can also get there pretty easily. If you buy such hosts you also get at least 5 years of “replacement disk next business day” with them. And a burn-in test upfront. So it’s not like you’re completely on your own.

    13. 2

      Cloud prices are atrocious if you are too small to be able to negotiate effectively. For one of my side projects (that doesn’t need HA or much bandwidth), I actually just built a server in my basement, because it will pay for itself in under 2 months compared to cloud. But at work I am dealing with companies that spend hundreds of millions or even billions on compute, and they are able to get some great deals from the big cloud vendors. They also tend to be too big to be able to run their own infrastructure very well (individual departments end up with their own data centers). For those customers, it’s a good deal.

    14. -5

      do we really need to suffer the presence of this ridiculous individual here?

      nothing against ciprian, but DHH and his bloviating can ruin a perfectly nice day.

      1. 11

        Why let someone’s existence bother you so much? Ignore it, move on, and have a happy life. Guess what, none of this matters and it isn’t worth stressing over.

      2. 1

        The original author can have sometimes “peculiar” opinions, however this article I think is important because it gives some real numbers (in terms of $) for the cloud vs self-hosted hardware debate.

    15. 1

      One point I don’t see mentioned enough is that keeping good, monitvated system/devops engineers seems pretty difficult. I’m not sure how true this is industry-wide but from my own personal anecdotes there’s always a struggle of maintaining your own infra and then that one guy who ran everything leaves and everything stops for a while.

      Using cloud provides gives one huge advantage of public knowledge and large hiring pool. This one of those cases where UX often wins despite being more expensive.

      If you manage to get it done though, you definitely reap the rewards in savings and imo having a much more fun and open infrastructure to experiment with! In one of my previous jobs we had an amazing system admin who’d brief us with all of the cool stuff he’d learn which was very entertaining and educational. In my last work place we ran on google cloud exclusively and the only time people would talk about devops/infra is when things didn’t work. I’m rambling here but working with infra can be interesting and exciting is what I’m getting to.

    16. 1

      just shooting my mouth off, but yea hardware is looking so attractive compared to $70k/mo cloud bill. the thing that depresses me is that it would mean going back to old school ops practices. all we got out of this cloud era is garbage tools kubeternes & docker. nobody in their right mind would run kubernetes on their own HW (which is probably by design), it’s terribly designed. fortunately tools like Rust & even Python’s poetry have really started fixing the code isolation, I would 100% feel safe running Rust apps on a server without docker (90% safe for Python XD). but man the orchestration…. what happens if I need to upgrade the kernel or add a new HDD? etc etc etc

      1. 3

        Hmm, I feel like there’s too much to respond to here. I’ll chose the second/third sentence.

        the thing that depresses me … all we got out of this cloud era …

        I guess if you are asserting you don’t like the current era of tools, you would rewind history back to where you liked it. So if we roughly went (pardon the reduction in parens):

        1. Mainframes (dumb terminals or teletype)
        2. Distributed PCs (as in desktops, desktop as servers)
        3. Bare metal servers (first “servers” based off commodity parts)
        4. Virtual machines (saturating your servers with vmware etc)
        5. Containers (defining your servers for shipping)
        6. Abstracted services / mesh / k8s (your servers are YAML now)

        Then you can rewind time to an approach you like. But then you are asserting that we made a mistake somewhere. I see some people going all the way back to dumb terminals even now in a way. There’s cloud gaming (screen painting) and rumors of Windows 12 being cloud-only. It’s not all or nothing but the pendulum of control vs flexibility or economies of scale I think are fairly pure. You want to centralize for control but then your costs are very high (surprise! that’s what #1 -> #2 was).

        So if you rewind time to #3 and use ansible, you probably aren’t going to saturate your servers. This wasn’t entirely the point of era #4 but there was some sales pitch at that time. “Don’t ssh in to manage your servers! Use chef/puppet/ansible! No pets!”. So if you rewind past cattle, you have pets and you’re saying pets are ok for you. And mixed in here are many other things I can’t mix into this layer like where does cloud fit in? Cloud sort of forces your hand to use more software definitions. The vendor probably has an API or some kind of definition tool. It’s interesting that this didn’t happen in the same way around #3 because you weren’t entering their domain to rent their servers where you are a guest that must conform because they have many customers and you are just one.

        This is mainframe vs app mesh debate is very current in the HPC world. I think many have settled on hybrid. This isn’t surprising to me. I personally err on hybrid, always. I guess I’m a hybrid absolutist (I guess I’m an absolutist?). Get your infrastructure to the point where you can make an infrastructure purchase decision in your own colo or the cloud. Have connectivity/tools/skills/culture/money ready to do either at any time. Mix and match. Of course, there are always caveats/trade-offs.

        Idk how to unpack the bits about poetry, kernel, HDD without writing a ton more.

      2. 2

        There are several forks/flavors of Kubernetes, like minikube, KinD, and k3s, which can be run at home on commodity hardware. I used to run k3s at home, and probably will again soon.