1. 97
  1.  

  2. 21

    I wanted to leave a hatted comment in the mod log, and there happens to be a usefully ironically story at #1 on the homepage I can use. We just had a couple hours of downtime because the disk on our one (medium vm) server filled with logs. It looks like logrotate broke a few months ago. Sorry for the outage. As a reminder, the chat room is also where we post status updates during outages.

    1. 14

      Not big enough

      1. 1

        Womp womp.

      2. 18

        Most companies are not using cloud as a replacement for colo. RDS, SQS, S3, managed elasticsearch, etc are really really valuable and difficult to replicate on your own. Of course the cloud vendors want to lock you in to these services and then overcharge you for the basics, just like some grocery stores lure you in with cheap specialty foods and then overcharge for bread and milk. It doesn’t mean it’s a bad deal though.

        1. 18

          RDS and S3 are standouts in part because the lock-in is operational, not architectural.

          You can develop against vanilla PostgreSQL, deploy on RDS, then change your mind – or at least threaten AWS with a change at contract renewal time – and switch to Fly.io’s managed Postgres. (Or any of the other excellent hosted offerings.) Or go “on-prem”, “edge”, etc. (I.e., run your own servers.)

          S3 was a moat but the API + domain model are now available from your choice of vendors, including Minio if you want to roll your own.

          I’m far more suspicious of applications that make heavy use of SQS, DynamoDB, etc. without having a really strong proof they need that scale and all the distsys pain it brings. You can get a long way on Celery (or your choice of “worker” tools) running batch jobs from your monolith against a “queue” table in Postgres. IME most projects/companies fail long before they outgrow the “COSP” threshold.

          For cost management, disaster recovery and business continuity, and the ability to work + test your systems offline, I think minimal cloud provider API surface in your application is a Good Thing. That + “don’t create a ton of microservices” (also good advice in most cases) usually implies: monolith + one big backend database + very select service extractions for e.g. PII that shouldn’t sit in the main DB.

          1. 3

            I think you nailed it here:

            the lock-in is operational, not architectural.

            1. 1

              You can develop against vanilla PostgreSQL, deploy on RDS, then change your mind – or at least threaten AWS with a change at contract renewal time – and switch to Fly.io’s managed Postgres.

              How does this work with security, though? Fly.io’s managed Postgres is going to be open to the internet, presumably, whereas in AWS I can control (and log) network access as I see fit.

              1. 3

                fly.io postgres is very much not open to the internet unless you need that for some reason.

                1. 2

                  I think Fly has a pretty good story here, actually: https://fly.io/docs/reference/private-networking/

                  But really, any managed DB vendor is going to have better network controls than “just use pg_hba.conf”. Most even offer AWS VPC bridging.

                  1. 1

                    Thanks for the link. I was maybe thinking of Supabase when I wrote the comment. Like if the business is providing managed databases but no compute then doesn’t the database basically have to be open to the internet so the backend servers can reach it? Eg talking to Supabase from Vercel or Netlify? Or can something clever be done with eg Wireguard to secure it all?

                    1. 1

                      There are a few approaches that services like this take. Sometimes they provide access over a VPN (e.g. through Wireguard, this is what Fly.io managed Postgres does if you connect from a non-Fly.io service and how you connect to private RDS databases from outside AWS), and sometimes they do just have a database listening on an Internet IP/port (maybe secured by some IP whitelisting, usually secured by TLS, and definitely secured by username/password authentication; this is what DigitalOcean managed databases, Supabase direct connections, and public RDS databases do)

              2. 3

                I guess it goes without saying that if you

                • need 99,99+% uptime and want to sueblame somebody big otherwise
                • need a distributed database for a ton of access that “Just works”
                • want a “familiar” stack where you can just slap some specific product of the three letter company as a requirement in the job description

                … then go to the big cloud providers and pay your premium (be aware of the network and database per-operation fees), you already made up your mind.

                But I’d bet that are maybe 1% of the customers.

                1. 9

                  need 99,99+% uptime and want to sueblame somebody big otherwise

                  I haven’t checked in a while, but I’ve never seen a cloud service actually meet this 99.99+% uptime. I don’t think any of them are very transparent about their historical outages anymore as they realized they weren’t having good uptime performance.

                  I checked a few years ago for $WORK, when some boss type wanted to move to the cloud, I compared out all the data I could gather from the various cloud providers and we handily beat them in uptime and total cost across time. I think I went back 5-ish years at the time, though I can’t seem to find that spreadsheet at the moment.

                  I agree there are valid reasons to move, but I would never blindly recommend switching dedicated stable compute to the cloud. Bursty compute however is a perfect fit for the cloud, and easy to recommend.

                  1. 1

                    I’m always worried about comparisons in uptime to someone’s single company to big clouds. AWS will have both more issues and more varied ones, but they’ll be often limited in scope. It’s hard to compare it to a smaller local setup without a list of specific risks and expected time to recovery. At an extreme, the box under my desk at home had 100% uptime in the last few years, but I wouldn’t make decisions based on that.

                    1. 3

                      I agree a single companies uptime comparison vs cloud providers isn’t very useful to outsiders, but it can be useful in that single companies decision making. That’s why we did the comparison.

                  2. 14

                    need 99,99+% uptime and want to sueblame somebody big otherwise

                    More importantly, don’t want to pay for in-house expertise to manage the systems when it is not part of their core competency. For smaller companies, they often need 10% of a very qualified sysadmin. They can either hire a full-time one for 10x the price of what they actually need, or outsource to a cloud provider and, even if the markup is 100%, be paying 80% less.

                    need a distributed database for a ton of access that “Just works”

                    The ‘Just works’ bit is far more important here than the ‘distributed’ or ‘ton of accesses’ part, because it translates to not having to pay an administrator.

                    want a “familiar” stack where you can just slap some specific product of the three letter company as a requirement in the job description

                    Again, this is a cost-saving thing. It’s much easier to hire in-house talent or to outsource a particular project to a (small or large) company if the infrastructure that they’re building on is generic and not something weird and bespoke that the developers would need to learn about.

                    In a huge number of cases, the cost of the infrastructure (cloud or on-prem) is tiny in comparison to the cost of the people to manage it. Using the cloud lets the provider amortise the cost of this over millions of customers and pass on a big chunk of that saving to you.

                    Buying a big server has a few drawbacks. If any hardware component fails, then you need to RMA that part, which means you need either an expensive support contract or you need someone on staff who is competent to identify the faulty component and send it back. If a cloud server fails, then your VM is restarted on another machine. If you are using PaaS offerings then someone else is responsible for building a platform that handles hardware failure and you don’t even notice.

                    If you want a separate test and production version, then you need at least two of those big servers, whereas with even IaaS offerings it’s trivial to spin up a clone of the production server for a test deployment on a different vnet and if you’re using PaaS then it’s even easier, and the number of test instances can easily scale with the number of developers in both cases.

                    TL;DR: If you think the cost of the hardware is important then either you’re thinking about massive deployments or you completely misunderstand the economics of this kind of thing.

                    1. 13

                      In my experience the companies I have worked for tend to end up at least doubling their spend when moving from dedicated to cloud for little added benefit and almost the exact same maintenance burden, in one case a company I worked for they went from £3,200/year spend on a managed 24-core/112GB RAM dedicated box with 1 hour SLA on having a tech at the datacenter make changes/do maintenance/etc to ~£1,400/month spend on far less resource except now they now had to handle the server changes/maintenance in house on top of managing the cloud infra which actually required hiring someone new to handle.

                      For my own company we rent two dedicated boxes (16-core/64GB RAM each) at a total cost of £108/mo which provides more than enough capacity, and in the past six years has had five nines uptime while costing a fraction of what it would have to go with cloud.

                      1. 1

                        now had to handle the server changes/maintenance in house

                        I’m not sure I understand. What server maintenance are you doing for a cloud based servers that’s comparable to the dedicated one?

                        with 1 hour SLA on having a tech at the datacenter make changes/do maintenance/etc

                        That’s 1h SLA to having someone look at the issue, not for a working replacement, correct?

                      2. 11

                        A couple of nits, directly:

                        More importantly, don’t want to pay for in-house expertise to manage the systems when it is not part of their core competency.

                        I would argue that managing systems is a core part of developer competency, and I’m tired of people acting like it’s not–especially when those people seem to frequently employed by companies whose business models depend on the meme of systems administration being some black art that can only be successfully trusted to the morlocks lurking in big data centers.

                        Using the cloud lets the provider amortise the cost of this over millions of customers and pass on a big chunk of that saving to you.

                        This is manifestly not what’s happening, though, as we’re seeing. The savings are being passed on to the shareholders–and if they aren’t, we should all be shorting MSFT and AMZN!

                        If you want a separate test and production version, then you need at least two of those big servers

                        Or, you know, you host both things on the same box under different VMs, or under different vhosts. This has been a problem with a well-known solution since the late 90s (though sadly not reliably applied).

                        you completely misunderstand the economics of this kind of thing.

                        Well…

                        • We’ve seen figures in this very thread of at least a 2x price increase using cloud providers.
                        • The option typically doesn’t exist to not have a sysadmin–we just hire “devops” people now, who are okay sysadmins who also tend to spend most of their time functioning as embedded salespeople for the vendor of their preferred stack. We’re out a six-figure salary regardless.
                        • If your team opts not to have a sysadmin (!), a bare metal or rented dedi is a lot easier to understand and maintain since it basically looks like a developer machine–just beefier and with actual paying customers on it.

                        I submit that perhaps we aren’t the only ones who misunderstand the economics. :)

                        ~

                        To be clear, there are some things like S3 that I just cannot be arsed to host. Hosted Postgres is nice when you don’t want to bother setting up metrics and automatic backups–but then again, I’m pretty sure that if somebody wrote a good script for provisioning that or a runbook then the problem would go away. It’s also totally fine to keep a beefy machine for most things and then spin off certain loads/aspects to cloud hosting if that’s your kink.

                        Remember, there was a time when the most sensible thing was to send your punchcards and batch jobs down to the IBM service bureau, because it was more economical. These things go in cycles.

                        1. 8

                          Addendum, reading back over this:

                          The more I think about this, the bigger issue is probably that if you run your own infra there’s the requirement that there be some continuity of ownership and knowledge–and that is difficult in an industry right now where average tenure is something like less than two years for startups.

                          1. 2

                            Most of my career so far has been, essentially, cleaning up somebody else’s historical mistakes by paving over them with my soon-to-be historical mistakes. An endemic part of the problem is always that very specific and arcane parts of the system are forgotten, or stop being understood, as the flow of brains does its thing. I used to be in camp “rewrite”, a decade ago. I’m now firmly in the camp “nooooooooo, fix it, please don’t do this to me, please please please fix it”

                            1. 2

                              I’m honestly dumbstruck by how obvious this is once it’s pointed out explicitly.

                              Even when I started out 15+ years back, I had the distinct impression that traditional “ops” roles tended to have far higher average tenures than developer roles.

                            2. 3

                              I would argue that managing systems is a core part of developer competency

                              I am not talking about developers, I am talking about companies. Most big cloud customers are not software companies, they are companies that have some in-house infrastructure that is a cost centre for their business: it is a necessary cost for them to make money, but it is not the thing that they make money from. They may employ some developers, but managing infrastructure and writing code are different (though somewhat overlapping) skill sets. Importantly, developers are not always the best administrators and, even when they are, time that they spend managing infrastructure is time that they are not spending adding features or fixing bugs in their code.

                              For a lot of these companies, they outsource the development as well, so the folks that wrote the code are contractors who are there for a few months and are then gone. An FTE sysadmin is a much higher cost.

                              This is manifestly not what’s happening, though, as we’re seeing. The savings are being passed on to the shareholders–and if they aren’t, we should all be shorting MSFT and AMZN!

                              That doesn’t follow. If it costs 100 times as much to manage 1000 machines as it does to manage one, then a company that passes on half of the saving to their customers will still be raking in cash. The amount that it costs to maintain a datacenter of a few tens of thousands of machines with a homogeneous set of services running in large deployments across them is vastly less that the cost of each customer maintaining their own share of that infrastructure.

                              We’ve seen figures in this very thread of at least a 2x price increase using cloud providers.

                              The numbers I’ve seen there are comparing hardware cost to hardware cost, which ignores the bit that’s actually expensive. They’re also talking about IaaS, which does not get most of the savings. And they’re talking about companies with steady-state loads, which is where IaaS does the worst. Renting a 64-core server is probably more expensive than buying one (a cloud vendor will pay less for it by buying in bulk, but that’s not a huge difference, and they want to make a profit). The benefit that you should get from IaaS is that you can move between a 2-core server and a 64-core server with a single click (or script) so that you can scale up for bursts. If you are a shop with a trickle of sales across the year and 100 times as many on cyber monday, for example, then you might need a 64-core system for 2 days a year and be happy with a 2-core machine the rest of the time. Comparing buying and renting a 64-core machine for the entire year is missing the point.

                            3. 2

                              Not just small companies. Larger companies often have terrible tech ops. Moving to ops as a service can be a way to fix that, though there is the danger that your existing ops people and processes will infect what you do in the cloud and either prevent you from getting the advantages or even make it worse than what you had.

                            4. 5

                              Interesting, it didn’t occur to me that only 1% of customers would want good uptime they’re not responsible for, a reliable database, and an easy to match watch-word for hiring.

                              1. 4

                                I’ve got 99,99 SLA one some tiny box at some irrelevant hoster in germany, with a downtime of 1 hour in 10 years when the whole box died (was up then again in 1hour on another system). So you could say I’ve got my 99,9% without any failover.

                                If that’s possible for a normal company with only some KVM + guaranteed CPU, RAM and bandwidth, you may not need the big(tm) cloud for that same hardware.

                                1. 4

                                  I have seen far more (and longer) outages caused by messing up with cloud systems than by hardware failure.

                                  Some examples I have personally seen:

                                  • Autoscaling policies based on CPU load / memory causing outages when load patterns shift
                                  • Brief but frequent “elevated error rates” caused by insufficient wait periods on scale-in events
                                  • Network speed degradation in AWS causing application outages
                                  • Cron-triggered script to terminate/delete un-tagged resources (to ensure people were tagging things for cost control purposes) ran during an outage of the AWS tagging service. All resources were reported as un-tagged and 30% of instances were terminated before it killed the instance it was running on.
                              2. 2

                                need 99,99+% uptime and want to sueblame somebody big otherwise

                                Many companies and even just clubs and stuff had that kind of uptime long before cloud providers even were a thing and if you look at guarantees from cloud providers you will generally not find more guarantees than what most companies provide. While cloud providers have more staff they also have way more complexity than smaller companies, bringing their own kinds of outages and every now and then you hit limitations of managed services, need to upgrade cause they decided to change something, which can be less planable than in your own company. And good luck if you hit some bug based on the particulars on how you use the service and going through layers of support lines, unless you are really big - big enough to easily do most stuff in-house.

                              3. 2

                                Elastic Search I set up ten years ago on physical machines and was fairly trivial. I think early on that was one of their main selling points. We even helped a very big bank to set it up on their infrastructure. When we came over to discuss any remaining topics they were done and built their own orchestration around it. Fun fact they built basically their own Nomad/Kubernetes and I think it was largely shell script (not completely sure though!). I don’t know how it is these days though.

                                S3 is pretty easy to replace and low maintenance with things like minio and seaweedfs.

                                And also, if you ever run any serious setup where you (think you) need the cloud you will certainly end up troubleshooting issues on the managed services, but only after scraping together enough evidence that it’s their issues. Even more fun when you have to go through their partners first. So you need people that are both experts in the field, but also experts with the particular cloud providers. So, in any capacity where you think you might actually need cloud providers you certainly need people that could easily set things up on their own. And that is why you can make a ton of money DevOps jobs, if you like doing that. There’s always need.

                                But even if you happen to never run into any of these problems. You usually need experts for technologies you use, way before your standard server setup is even close to limit you somehow. And usually it’s not a clear cut how much they need to know. So they will certainly know how to run these technologies. Again, that’s if you don’t run into any issues with your cloud provider’s setup and that at some point will happen, even with Amazon and Google. After all they also run physical hardware, have tons of management infrastructure that also can have bugs, have situations that their monitoring doesn’t detect.

                                The biggest thing is that you can blame them, but then you need to be able to proof it, which can be really hard at times, especially if you don’t know their setup.

                                I think there is a lot of “right sounding” things said about cloud computing, that also typically aren’t inherently wrong, but still at best only apply to the practical reality to a certain degrees and cloud providers would be stupid not to make statements based on that and people wanting to get DevOps jobs, do consulting, sell books do the same. I think it’s rarely intentional though. It’s just easy to make a generic true-ish statement to justify what we do. But that goes into psychology.

                                1. 2

                                  That’s the thing. There are a small number of companies whose domain/problem space is such that they can 100% avoid lock-in by treating cloud instances strictly as VMs and running all their own services, but as your needs grow that can be SUPER hard to maintain without a sizable investment in engineering which not every company is willing to make.

                                  Maybe they should? But they aren’t.

                                2. 9

                                  Each CPU core is substantially more powerful than a single core from 10 years ago.

                                  What also changed substantially in 10 years is software. I would expect Rust to be times faster than Java of old, and, unlike C++, to be a somewhat reasonable option to actually implement HTTP backends.

                                  I think at some point we might get a trend of a fast Rust program on a single beefy machine, as a much cheaper to develop alternative to cloud architectures. Rust’s hard, and writing performant code is hard, but not as hard as distributed systems.

                                  If we do get such single-box applications, we might see some innovation in the database world: would be a good opportunity to implement a new embedded database with first-class support for streaming updates to a backup server, and to maybe try a “better is better” implementation of a relation model.

                                  1. 5

                                    I suspect that there are still significantly more web applications running on Java than on Rust. While Rust has improved the landscape in some ways, it seems unlikely to me that it has made much of a dent in what is actually running in the world.

                                    1. 6

                                      I can’t imagine anybody wanting to write application code in Rust unless they have extremely niche performance needs.

                                      1. 4

                                        Like, infinitely more :) I wouldn’t expect the average application to move towards single-box architecture. But I wouldn’t be surprised if we see a non-trivial amount deliberately non-distributed applications in the future.

                                        1. 6

                                          There are tons of Java backends happily running on a single Tomcat instance. That’s been in many ways the reference architecture for “three-tier-ish” Java projects since the early 2000s, and I suspect the number (if not total scale) of deployments absolutely swamps the horizontally-scalable “SOA” designs in the wild.

                                          1. 4

                                            Yeah, if we wanna discuss what’s taking Java share, better look at Go. And even there, it’s still very little.

                                          2. 4

                                            This isn’t a controversial statement. There are orders of magnitude more Java applications than Rust applications out there. If anyone honestly thinks otherwise, I’d like to see any justification for that.

                                          3. 4

                                            I would expect Rust to be times faster than Java of old, and, unlike C++, to be a somewhat reasonable option to actually implement HTTP backends.

                                            No, there is a real reason Java became dominant in server land. Java, even Java of old, gets close to native performance - the problem Java has is interactive apps, as interactive apps are more pause sensitive, and do not run for long.

                                            The codegen itself is on par with C/C++, the only penalties it suffers over them are bounds checks and GC, but for them the penalties for the kind of software that runs server side are negligible.

                                            1. 4

                                              I don’t think codegen is the thing to focus on in Java/C++ comparisons. The much bigger difference is in data layout in address space: Java has way more indirection, and needs much more RAM (and L1).

                                              I think both of these are true:

                                              • Java has plenty of performance for web
                                              • On any given box, Rust will handle significantly more connections
                                              1. 1

                                                I am not sure that’s true, I definitely don’t think it would be “significantly more” in a practical workload - I could make inefficient Java memory layout and use them in a way that prevents the JVM from doing any devirtualisation, but I’m likely not going to.

                                            2. 2

                                              You don’t have to go all the way to rust. Writing Web apps in rust is not simple. You can choose other frameworks that are very performant and compiled, but simpler to use. For example crystal + Amber/lucky.

                                              1. 9

                                                or even Go…

                                            3. 3

                                              One area I’ve experienced the annoying sides of “on-demand compute” is CI. Nothing is cached effectively in our pipelines because when a job is created, it’s just sent to some random cloud machine. Then the job has to download a docker image, install all the dependencies from scratch. Yes, reproducible, deterministic builds and all that but… honestly, my computer in the office which has the base image and all the cached layers and NPM/Go dependencies already is ~600% faster than cloud CI. Which is why I’m very happy self-hosted job runners exist for GitHub/GitLab.

                                              1. 1

                                                Run two, have failover. Sadly things like CARP aren’t possible for a lot of colos/server hosts.

                                                1. 1

                                                  On CDNs:

                                                  I wonder what it would take to create something like a simple to set up CDN solution. Not to match big CDNs, but to handle the most common use cases of CDNs, so keeping (relatively) static data, in the simplest case just classics, like JS, images, etc. closer to the user. The software certainly is out there, but I think some more integrated and standard way of setting things up could help to have a fall-back or simply something for smaller or personal projects.

                                                  1. 1

                                                    I like the pattern of structuring a single big service on a single machine as a bunch of local subservices. That way you can have teams working on and maintaining different subservices as standalone git/Github projects. Typically we’ll keep everything in a single repo until it makes sense to move a subservice to it’s own git repo. You can spin them up with a script or other things. Plug them into systemd. Or even run some container orchestration locally, if you want that extra pain. They all talk to each other via an api. You can use any api medium: graphql, REST, grpc, etc. The communication happens over the local network. No webapp, or whatever, communicates directly with the subservices; it’s just request/response with the service. And added bonus that if you want to spin out one of the subservices to a different machine (microservice), you can do that.

                                                    1. 1

                                                      I’m a little confused as to why you’re describing microservices here?

                                                      1. 2

                                                        I was describing a single ‘server’ on a single ‘machine’. Sometimes we talk about a server as a physical thing, while other times as a literal program that is a server. The article is basically talking about a server as if it was rooted in hardware – a single machine server. Hence, the description of renting a small server with all the specs in said article. I was saying you can break up your service on a single machine if it makes sense. You can rent a single server to do it. You don’t have the overhead costs and complexity like an actual microservice distributed across the internet. But you can keep the benefits as far as modularity. Idea of microservices implied in the article, and generally understood informally, is one that is distributed across a physical space with each service having it’s own cpu, memory, storage, and network resources literally or virtually in someway like in the cloud. Again, I was describing a single machine.

                                                        1. 3

                                                          I see, thanks for explaining

                                                          You don’t have the overhead costs and complexity like an actual microservice distributed across the internet.

                                                          I think microservices on a single machine still come with a lot of added complexity though. How do you debug/troubleshoot across services? How do you deploy and do orchestration? Etc. obviously more distribution is even more complicated but microservices are complicated at any scale.

                                                          1. 2

                                                            Agreed. A locally distributed service is more complicated than a single standalone service. However, there are benefits for the modularity and making it easier for teams to work with less coordination - just don’t break the api. So it’s not extra cost for no gain. And it’s not as bad as a proper microservice at scale sprawled all over the internet – an order of magnitude, I think several, more complicated than locally hosted microservices. For example network failures or major lags are far less likely to happen between services on the same local network. And once you get used to an orchestration pattern, it’s pretty easy to implement it going forward. Can use systemd, bespoke scripts, or some other orchestration framework. To each their own, I suppose. And depends on the product to be delivered.

                                                    2. 1

                                                      I’m sympathetic to this point of view, but I think the real message is hidden here.

                                                      My Workload is Really Bursty

                                                      Cloud away.

                                                      Although it obviously varies, I would say almost every workload I’ve encountered is pretty bursty.

                                                      I think a self-hosted architecture with cloud “accelerators” makes a lot of sense, and people have done that. Although networking is an issue, etc.

                                                      A related issue is that growth is unpredictable and having cloud resources can be handy there (although writing more software to use them shouldn’t be underesimated)

                                                      For better or worse, I would say that it doesn’t make sense for many apps be strictly colo’d now. I suppose you may have internal web UIs that are low traffic and are rarely used, but there are some advantages to hosting those in the cloud too.

                                                      1. 2

                                                        Although it obviously varies, I would say almost every workload I’ve encountered is pretty bursty.

                                                        I am curious about what workloads you encountered and what’s “pretty bursty”. And what the scale is. I completely agree with that, but you are not really big (like nearly everyone having heard of you company or product name) or you bust at least 10x+ or so you might still end up having an easy time to as the article says provide infrastructure for peak loads, simply because of various cost savings.

                                                        A related issue is that growth is unpredictable and having cloud resources can be handy there (although writing more software to use them shouldn’t be underesimated)

                                                        Also I’ve personally experienced a situation where I should have used something like Hetzner or OVH, because the cloud didn’t allow me to scale when I badly needed to. Specially GCS ended up not allowing their quota to be raised because seemingly they were out of resources and the contract didn’t allow the company I worked for to just switch to another country, so switching region wasn’t an option. I was really surprised here, because to me that’s the one reason where I’d recommend using a cloud service.

                                                        Also growth often tends to be a lot more predictable than people imagine. Yes, there’s exceptions, but there’s a couple of things to keep in mind. You actually need to be prepared to scale properly, even when relying on the cloud. I’ve seen more than one company saying that they can easily scale, because cloud, when in reality that wasn’t the case. Luckily they found out using load tests though.

                                                        Another thing is that in such a situation with well engineered non-cloud services it should not be too hard to switch or do a hybrid solutions. If not your employees probably aren’t ready for the cloud.

                                                        And most importantly, a lot of “big cloud provider” statements are also true for “big hosting/colo providers”.

                                                        1. 1

                                                          I”m thinking about public web services I worked on … they probably had 100K or 500K users (?). There were a few sources of load variation :

                                                          1. natural user cycles – day and night, weekend vs. weekday (although I guess provisioning for peak is fine)
                                                          2. crawlers and spam – can take up to 90% of your load
                                                          3. user growth
                                                          4. “slashdot effect” – reddit / Hacker News

                                                          And also “data science” stuff workloads which were EXTREMELY bursty. e.g. at Google running a job with 1000 machines for 3 hours, and then trying again a few days later, etc.


                                                          For a data point on the other end of the spectrum, I also use a CI service right now for open source projects. It doesn’t really make sense to self-host the CI because I push between 1 and 10 commits a day. And for those commits I start up 8 VMs in parallel and try to get them to complete as fast as possible (on Github Actions).

                                                          So that is very low load, but still very bursty.

                                                          I should note that I don’t use any cloud VMs at all now :) I just use Dreamhost for my blog, which is static, and it’s great. Honestly a single shared machine that someone else manages and patches is great. And I even have a little dynamic content.

                                                          So I try to avoid cloud VMs, but I also don’t want to manage physical servers. I think “serverless” is what makes sense for most programmers and many programming teams, although the current offerings aren’t great …


                                                          Interesting about GCS not being flexible. There are definitely a lot of downsides to the cloud … I don’t think it is more reliable than a single machine. But if you ask me to administer a physical machine for most of my projects I’ll also say “no thanks”. And I think hiring people to do that can be a pain, and most people/companies don’t want to be “beholden” to them