1. 80
  1. 52

    Over the past few years of my career, I was responsible for over $20M/year in physical infra spend. Colocation, network backbone, etc. And then 2 companies that were 100% cloud with over $20M/year in spend.

    When I was doing the physical infra, my team was managing roughly 75 racks of servers in 4 US datacenters, 2 on each cost, and an N+2 network backbone connecting them together. That roughly $20M/year counts both OpEx and CapEx, but not engineering costs. I haven’t done this in about 3 years, but for 6+ years in a row, I’d model out the physical infra costs vs AWS prices, at 3 year reserved pricing. Our infra always came out about 40% cheaper than buying from AWS for as apples to apples as I could get. Now I would model this with savings plan, and probably bake in some of what I know about the discounts you can get when you’re willing to sign a multi-year commit.

    That said, cost is not the only factor. Now bear in mind, my perspective is not 1 server, or 1 instance. It’s single-digit thousands. But here are a few tradeoffs to consider:

    1. Do you have the staff / skillset to manage physical datacenters and a network? In my experience you don’t need a huge team to be successful at this. I think I could do the above $20M/year, 75 rack scale, with 4-8 of the right people. Maybe even less. But you do have to be able to hire and retain those people. We also ended up having 1-2 people who did nothing but vendor management and logistics.

    2. Is your workload predictable? This is a key consideration. If you have a steady or highly predictable workload, owning your own equipment is almost always more cost-effective, even when considering that 4-8 person team you need to operate it at the scale I’ve done it at. But if you need new servers in a hurry, well, you basically can’t get them. It takes 6-8 weeks to get a rack built and then you have to have it shipped, installed, bolted down etc. All this takes scheduling and logistics. So you have to do substantial planning. That said, these days I also regularly run into issues where the big 3 cloud providers don’t have the gear either, and we have to work directly with them for capacity planning. So this problem doesn’t go away completely, once your scale is substantial enough it gets worse again, even with Cloud.

    If your workload is NOT predictable, or you have crazy fast growth. Deploying mostly or all cloud can make huge sense. Your tradeoff is you pay more, but you get a lot of agility for the privilege.

    1. Network costs are absolutely egregious on the cloud. Especially AWS. I’m not talking about a 2x, or 10x, markup. By my last estimate, AWS marks up their egress costs by roughly 200-300x their costs! This is based on my estimates of what it would take to buy the network transit and routers/switches you’d need to egress a handful of Gbps. I’m sure this is an intentional lockin strategy on their part. That said, I have heard rumors of quite deep discounts on the network if you spend enough $$$. We’re talking 3 digits million multi-year commits to get the really good discounts.

    2. My final point, and a major downside of cloud deployments, combined with a Service Ownership / DevOps model, is you can see your cloud costs grow to insane levels due to simple waste. Many engineering teams just don’t think about the costs. The Cloud makes lots of things seem “free” from a friction standpoint. So it’s very very easy to have a ton of resources running, racking up the bill. And then a lot of work to claw that back. You either need a set of gatekeepers, which I don’t love, because that ends up looking like an Ops team. Or you have to build a team to build cost visibility and attribution.

    On the physical infra side, people are forced to plan, forced to come ask for servers. And when the next set of racks aren’t arriving for 6 weeks, they have to get creative and find ways to squeeze more performance out of their existing applications. This can lead to more efficient use of infra. In the cloud world, just turn up more instances, and move on. The bill doesn’t come until next month.

    Lots of other thoughts in this area, but this got long already.

    As an aside, for my personal projects, I mostly do OVH dedicated servers. Cheap and they work well. Though their management console leaves much to be desired.

    1. 16

      Having recently introduced a “please explain to me what how a | is used in a bash shell” question in my interviews, I am surprised by how many people with claimed “DevOps” knowledge can’t answer that elementary question given examples and time to think it out (granted, on a ~60 sample size).

      Oh, this is a gem! It will go right next to “why stack the memory area has the same name as stack the data structure” into the pile of most effective interview questions.

      1. 12

        Do these questions even work? seriously. I remember interviewing someone who didn’t have the best concepts of linux, shell,etc but he knew the tools that were needed for the DevOps role and he gets the job done; knowing things like what a shell pipeline doesn’t factor in for me.

        In terms of the article itself, like I said above, people know AWS and know how to be productive with the services and frameworks for AWS. that alone is a figure hard to quantify. Sure I could save money bringing all the servers back internally or using cheaper datacenters, but I worked at a company that worked that way. You end up doing a lot of busy work chucking bad drives, making tickets to the infrastructure group and waiting for the UNIX Admin group to add more storage to your server. WIth AWS I can reasonably assume I can spin up as many c5.12xlarge machines as I want, whenever I want with whatever extras I want. It costs an 1/8 of a million a year, roughly. I see that 1/8 of a million that cuts out a lot of busy work I don’t care about doing and an 1/8 of a million that simplifies finding people to do the remaining work I don’t care about doing. The author says money wasted, I see it as money spent so i don’t have to care, and not caring is something I like; hell it isn’t even my money.

        1. 4

          I remember interviewing someone who didn’t have the best concepts of linux, shell,etc but he knew the tools that were needed for the DevOps role and he gets the job done

          I have to admit, I’ve never interviewed devops, only engineers. And in my experience, it’s more important for an engineer to dig into fundamental processes that he’s working with, and not just to know ready-made recipes to “get the job done”.

          1. 7

            I agree completely with this statement, and I think this is exactly what the article mentions as one of the lock-in steps. The person can “get the job done” because “they know the tools” is exactly the issue - the person picked up the vendor-specific tools and is efficient with them. But in my experience, when shit hits the fan, the blind copy-pasting of shell commands starts because the person doesn’t undersand the pipe properly.

            Now, I don’t mean by that that the commenter above you is wrong. You may be still saving money in the long run. I’m just saying that it also definitely increases that vendor lock in.

          2. 3

            I feel like saving your company of whatever scale $15,000 a year per big server worthwhile, as long as it doesn’t end up changing your working hours. I know that where I work, if I found a way to introduce massive savings, I would be rewarded for it. Shame SIP infrastructure is so streamlined already…

            1. 2

              It is optimized for accuracy, not recall. This question may have some positive correlation with good devOps. It may just have positive correlation with year-of-experience, hence, good devOps. Hard to quantify.

            2. 2

              Too bad the author didn’t specify how many is “many”. I would expect some of the interviewees not answering because of interview stress, misunderstanding the question etc.

              1. 25

                This is not an answer in vogue, but I don’t want ops people who get too stressed to be able to explain shell pipelines.

                1. 12

                  In my experience, a lot of people that get stressed during interviews don’t have any stress problems when on the job.

                  1. 6

                    Indeed. I once interviewed an engineer who was completely falling apart with stress. I was their first interview, and I could tell within minutes they had no chance whatsoever of answering my question. So I pivoted the interview to discuss how they were feeling, and why they were having trouble getting into the problem. We ended up abandoning my technical question entirely and chatting for the rest of the interview.

                    Later, in hiring review, the other interviewers said the candidate nailed every question. Strong hire ratings across the board. Had I pressed on with my own question instead of spending my hour helping them de-stress and get comfortable, we likely never would have hired one of the best I’ve ever worked with.

                  2. 7

                    I quite disagree with this, perhaps because I’m the type of person that gets very stressed out by interviews. What you’re saying makes sense if we assume that all stressors are uniform for all people, but that doesn’t really match reality at all.

                    For me, social situations (and interviews count as social situations) are incredibly, sometimes cripplingly stressful. At worst, I’ve had panic attacks during interviews. However, throughout my entire ops career I’ve worked oncall shifts, and had incidents with millions of dollars on the line, and those are not anywhere near the same. I can handle myself very well during incidents because it’s entirely a different type of stressor.

                    1. 4

                      Same in my company. All engineering is on-call for a financial system and it’s very hard to hire someone that get stressed out during the interview when this person would have to respond to incidents with billions in transit.

                      1. 4

                        Yep. I have a concern that in our push to improve interviewing we are overcorrecting.

                    2. 5

                      I’m helping my company interview some people in that area. We have a small automated list of questions around 10 to 12) that we send to candidates that apply, so nobody loses time with things that we’ve agreed interviewees should know.

                      Less than 10% manage to answer questions like “Which command can NOT show the content of a file?” (with a list having grep/cat/emacs/less/ls/vim).

                      When candidates pass this test, we interview them, and less than 5% can answer questions like the author mentions, at least in a

                      1. 3

                        Kinda unrelated to the article was just an anecdote to say “there’s a load of people that can’t really use a classic server and need more modern IaaS to operate”.

                        For the sake of defending my practices though, I did give people 5 minutes to think of the formulation and gave examples via text of how one would use it (e.g. ps aux | grep some_name). I think the amount of people that couldn’t answer was ~2/5. As in, I don’t think I did it in an assholish “I want you to fail way”.

                        It’s basically just a way to figure out if people are comfortable~ish working with a terminal, or at least that they used one more than a few times in their lives.

                        1. 5

                          On the other hand, I can operate a “classic server”, but struggle with k8s and, to some degree, even with AWS. Although I’m sure I can learn, I simply never bothered to do so as I never had a reason or interest. I suppose it’s the same with many who were raised on AWS: they simply never had a reason to learn about pipes.

                          1. 1

                            I didn’t imply malpractice, rather statistical error. That said, anywhere close to 2/5 in conditions you described… It’s way higher than what I would expect. I didn’t hire any DevOps recently tho, so maybe I’m just unaware how bad things got.

                          2. 1

                            This is always true for interviews, but this is a measurement error that would be present for any possible interview question.

                            1. 1

                              Yeah that was my point.

                        2. 8

                          The largest railway operator in Europe, Deutsche Bahn recently announced they finally moved completely into the cloud. This article confirms my own back of the envelope calculations that they just increased their operational cost by a good amount. Cloud is like really cool, if you are getting started and you do not have any capital. In many other scenarios you pay a good premium for what might become a vendor lock in over time.

                          1. 11

                            There are two things that make the cloud cheaper than alternatives:

                            • You’re outsourcing things like physical security, buying replacements, installing software updates, and so on and so can share those costs with a load of other companies and pay a lot less than you’d pay full-time admin staff to do them.
                            • You’re paying for what you need. You’re probably going to pay more for base load than you could otherwise, but if your peak demand times are ten times higher than your baseline then your average cost is going to be a lot lower than if you had to provision enough infrastructure to cover your peak load all of the time (this is where AWS came from: Amazon had a load of spare infrastructure from buying enough servers to cover peak buying time and wanted to make money from it the rest of the time).

                            That said, this article isn’t discussing the merits of the cloud, it’s comparing two cloud providers. I don’t know OVH, but my experience with other smaller providers is that they typically have a single datacenter and various single points of failure that can cause long periods of downtime. As a result, they’re cheaper. For my personal use, that’s absolutely fine: if I have 10 hours of downtime, it costs me absolutely nothing and I’m much rather pay less and occasionally grumble. For corporate use, 10 hours of downtime may cost more than I’m paying for a year of the server.

                            1. 3

                              OVH datacenters one of the bigger players. I’m personally running stuff on a “root” VM from netcup, with an “Minimum availability” of 99,9% for one simple personal Host. So yeah of course can amazon provide much more - but others can also be pretty well equipped or have a really good guarantee. Why buy from them if you really don’t it at all?! I don’t think this requirement is really necessary for most of the companies moving to the “cloud”. They probably have more outages due to their own misconfigurations and errors in their own software.

                          2. 11

                            So, SourceHut is not hosted in anyone’s cloud. I own all of the hardware outright and colocate most of it in a local datacenter.

                            I just built a new server for git.sr.ht, and boy is she a beaut. It cost me about $5.5K as a one-time upfront cost, and now I just pay for power, bandwidth, and space, which runs about $650/mo for all of my servers (10+).

                            Ran back of the napkin numbers with AWS’s price estimator for a server of equivalent specs, and without even considering bandwidth usage it’d cost me almost TEN GRAND PER MONTH to host JUST that server alone on AWS.

                            AWS is how techbro startups pile up and BURN their investor money.

                            https://cmpwn.com/@sir/103496073614106505

                            1. 4

                              Granted, I think comparing on-premise vs IaaS is a bit unfair since getting good latency around the globe and a large enough bandwith becomes difficult.

                              That being said, a few years back I did have an on-premise infrastructure, it was with the first generation Ryzen CPUs and (then kinda new) NVMEs + infiniband… and by golly it ran beautifully. I think that the monthly AWS cost for equal performance would have been at least 0.5x time what it took us to build it (and back then AWS had no NVMEs either).

                              But comparing against on-premise wouldn’t be exactly fair.

                              1. 1

                                But comparing against on-premise wouldn’t be exactly fair.

                                (Assuming you cound “local datancenter” as on-prem.) Maybe, but just up to a point. You mentioned in the article that for the most people, Amazon is cheaper or free, then 100k a year is affordable. When you get to the scale of the $100k to AWS a year, which is approximately Drew’s number here, you’re orders of magnitude cheaper. Which I think was kind of the point of your article.

                                For such small companies, global edge nodes either don’t make a difference (except maybe CDN), and as they grow up, they could still add edges in the appropriate locations. Hell, Netflix brings their boxes to the indivudual ISPs and hooks them up directly. But to get started with it, and even operate a neat operation, you would be cool with 1 box in 1 DC, like sourcehut is.

                                And let’s leave aside both the reason why Drew doesn’t want to go to the big players, and the fact that neither he, nor a lot of companies in this range, needing a small rack equivalent to 10k monthly AWS services probably really, really, really don’t tens of millions of investor money to burn.

                                I am aware that is’s all numbers game and that such companies don’t care - they’re in the race to get those millions and dealing with hardware slows them down, so in that case, the comparison is likely not fair.

                              2. 4

                                Around my area it seems that most of the smaller colos have closed shop, or been bought out by bigger players (like flexential). Presumably many smaller customers having moved to “the cloud”, so they closed or sold out.
                                There are several rather larger colos, but they are all catering to very large clients (nike, akamai, microsoft, etc), and have the “call us for quote” stuff and only want to talk to big players.

                                I remember back in the day colo seemed far more common and attainable for smaller footprints (eg. a 6u or 12u half-cabinet).

                              3. 6

                                I disagree that the OVH to AWS comparison is apples to apples. You’re comparing a non-redundant SSD drive to EBS, for starters. You’ll need to price out a SAN to be fair.

                                There are other price factors at play here that aren’t immediately apparent. I work for a modestly-sized managed hosting platform. We originally launched on Digital Ocean, but at the time the higher-end servers were too expensive, so we expanded onto Linode for those. That has widely been regarded as a bad move, as the difference in reliability of the two platforms makes Linode more expensive for us once you factor in all the support costs.

                                1. 3

                                  I disagree that the OVH to AWS comparison is apples to apples. You’re comparing a non-redundant SSD drive to EBS, for starters. You’ll need to price out a SAN to be fair.

                                  The 150% more expensive for less compute and RAM quote though, doesn’t include the EBS cost, for that exact reason. I wanted to be as charitable as possible. That’s why I chose not to include EBS cost, to have the AWS machine be ~2/3 as good and to compare reserved for 1 year vs per month. AWS had leeway comapred to OVH in every single metric and the numbers were AWS being 150% higher. If I hadn’t given that leeway the numbers could arguably be 300-500% higher.

                                  We originally launched on Digital Ocean, but at the time the higher-end servers were too expensive, so we expanded onto Linode for those. That has widely been regarded as a bad move, as the difference in reliability of the two platforms makes Linode more expensive for us once you factor in all the support costs.

                                  Availability is indeed and issue, but eod neither of the two provider have an amazing SLA for those machines.

                                2. 5

                                  I just want to point out that op is comparing the OVH server with on demand ec2 instances prices. Imho a better comparison would be a Reserved instance, lowering that price to somewhere around the 14k$ year. (see ec2info

                                  1. 5

                                    I was comparing with reserved for 1 year (I believe), when I quoted the 26k/year number. That’s why I said that OVH basically had the edge for being monthly instead of yearly.

                                    1. 1

                                      thanks and apologies. I did not see where you got the info, and went for the classic 3y upfront in virginia. ignore me

                                    2. 3

                                      The Paris region is a little more expensive. For the r4.16xlarge, it looks like if you pay in full up-front it’s $26k if reserved for 1 year, $17k if reserved for 3 years (about +4k on the US Virginia prices).

                                      The prices used in the article are the paid up-front one year prices ($25,771) and I don’t know where the “paid hourly” ($37,282) price comes from.

                                    3. 4

                                      This is kind of a weird article for me (no offense to the author). Of course AWS is expensive, everyone knows that. But you get access to a wide variety of managed services, including storage, databases, networks, and thousands of third-party products - all of which you can use without OKing a new vendor through your company’s billing department, which might take months. If you start comparing the costs of managing all those to the cost of staff time, it’s a bargain. (Perhaps the author has never worked at a typical slow-moving corporation.)

                                      The real question in my mind is whether the big cloud providers will all get to feature parity and start aggressively competing with each other on price. That’s what happened with plain virtual server hosting years ago, and happened to physical servers many years before that. I think it will happen eventually, but whether it happens in 2025 or 2045 is anyone’s guess.

                                      1. 4

                                        Employees end up deciding most of what a company is using internally, including infrastructure providers.

                                        I want to add something very powerful and important to the employee lock-in story - looking cool and professional (The “looking” part being key here). You have two driving forces at play:

                                        • Unmotivated employees just wanting to “get the job done” while minimizing risk. To paraphrase the old IBM saying “Nobody got fired for buying AWS.”
                                        • Motivated, but inexperienced, employees that want AWS technologies on their CV and are after that Cloud Solutions Architect job title. Hats off to the Amazon marketing department for coming up with the AWS sysadmin/developer certifications.
                                        1. 1

                                          Hats off to the Amazon marketing department for coming up with the AWS sysadmin/developer certifications.

                                          This isn’t exactly new I believe.

                                          As in, IBM, Oracle and redhat had similar certifications, correct ?

                                          1. 5

                                            Not new at all. It is a great marketing trick, and it works brilliantly every time.

                                            Oh, how proud was I when I became a MICROSOFT CERTIFIED SOLUTIONS DEVELOPER back in the early days of .NET 🤡

                                            I would never consider any other hosting or development platform back then.