OP is an engineer at AWS and his arguments make sense from the standpoint of a large tech conglomerate with high availability requirements (yes, including us-east-1 ).
Not all companies are at the stage where throwing bodies at DevOps for the sake of tail latency makes economical sense.
He didn’t really make an argument from the standpoint of a large tech conglomerate; he just said “distributed systems confer these other properties besides scale”. If you care about those properties (whether or not you are a large conglomerate) you should consider distributed systems.
In particular, Kubernetes has matured quite a lot; it’s pretty easy to get a Kubernetes cluster up and running even at the small scale, and you get a bunch of stuff for free that you would have to cobble together yourself from disparate tools if you were managing everything on one or two hosts. Additionally, because Kubernetes is standardized, you benefit from a vast pool of personnel who are experienced with it as well as online articles and StackOverflow posts that target it in particular.
I feel like this article is bound to confuse and get people talking past each other, because
It doesn’t name any examples of applications / workloads. I’d say it makes a broad comparison between “ideal distributed system” and “ideal single machine” which doesn’t exist in practice
It presents these 6 things as advantages of distributed systems: availability, durability, latency, specialization, isolation, changes
I think it would be better to
name and categorize the application / workloads
present each property you want as a tradeoff along a spectrum
For example this is a very broad claim about availability, that I think really needs a bunch of caveats about what happens with specific applications in practice:
Distributed systems achieve exponentially better availability at linear cost, in a way that’s almost entirely hands-off.
It acknowledges precisely when this isn’t true – correlated failures.
The problem is that cloud systems now have single points of failure that by themselves are more complex than a single machine.
A recent example is Unisuper being down for multiple days because of an administrative error in Google cloud. That is an example of a distributed system — across multiple data centers as I understand it – failing like a single machine.
On the other hand, I’m not exactly anti-cloud – I definitely don’t want the situation that sourcehut faced – hosting provider unable to handle DDoS, having to ship physical servers in the mail, and then losing the servers:
So there is an extremely complex set of tradeoffs here, and I don’t think the article is acknowledging that – I think it is kind of reacting to some people “taking a side” by “taking the other side”
I think we haven’t figured it out, and building distributed systems is still a big pain that requires a ton of work from many people. And often you still get a pretty bad result. I think that is what the “single machine” camp is reacting to.
On the other hand, I’m not exactly anti-cloud – I definitely don’t want the situation that sourcehut faced – hosting provider unable to handle DDoS, having to ship physical servers in the mail, and then losing the servers:
When I saw sourcehut’s post, I immediately thought about all of the anti-cloud posts that have come up recently. It is a good reminder that there are no easy choices. It’s tradeoffs all the way down.
Damn I guess this is a case study in why the cloud became popular. Getting DDoS’d, having to migrate, and then losing physical servers in shipping is honestly crazy. I’m a software person and couldn’t deal with that stress :-/
I wonder which cloud provider/product people would choose for something like sourcehut if you just want servers, not necessarily other services like all the AWS/Google Cloud stuff.
And what the price premium is? I’m sure you pay a lot, but having the outlier “once in 3 years” events being handled by professionals seems like a good tradeoff. Then again there’s no guarantee that if they’re DDoS’d that they can handle it in a reasonable way … Not sure how that works with smaller cloud providers.
As perhaps one of the somebody’s making the argument that they refer to, I’d like to say that my point has always been that scale and security are the only necessary reasons for having more than 3 computers (which all run the same workload in sync for durability and reliability, hence the workload needs to fit on one machine).
The other reasons the post gives are all very real in terms of practical tradeoffs today, which is why I don’t actually recommend people build single-machine systems. But they’re tradeoffs due to inadequate software rather than due to benefits from additional hardware. Some of this tech already exists inside trading companies, although not all of it.
I unfortunately have never fully written up all the parts of my vision for what the software could look like to address all the other considerations in a way such that 3-machine systems could be faster, easier and simpler than many-machine systems.
But they’re tradeoffs due to inadequate software rather than due to benefits from additional hardware.
I don’t think this is true, at least for availability and durability.
Let’s take availability, for example: there’s a “small number” effect if you have two or three machines where, to survive any failure at all, you need either 2x or 1.5x extra capacity. That caps utilization at 50% or 66%, respectively. Systems with larger numbers of machines can achieve significantly higher utilization (and lower cost), depending on the failure rate of their hardware and the achievable repair rate.
Durability is even more stark: using reasonable parameters typical of modern storage devices, a 3-of-6 code offers about 8 orders of magnitude more durability than 2-replication at the same overhead, and 3 orders of magnitude better than 3-replication at half the overhead (on storage). That’s at the cost of IOPS, of course, and the independence assumption, but shows that there’s a lot of room for exponential improvement beyond 3 machines.
Now, I would agree that the software and services aren’t there yet (it’s why I’ve spent the last decade on my career on serverless).
Most systems are not mission critical. They don’t require high-availability and crazy durability guarantees. Distributed systems incur huge complexity and operational costs that, assuming they actual work are probably an order of magnitude more expensive for every extra 9 of uptime/reliability. This saves you from what? Maybe a few minutes of additional downtime a year? I’d rather have a few hours of downtime for a monolith that’s hosted on one machine (maybe with a prod-parallel backup host that I can cut over to if availability actually matters) that’s easy to debug by reading a few log files than deal with a microservice-laden enterprise monstrosity that’s partially broken all the time, thanks.
And that’s assuming distributed systems actually do provide more reliability (which I’m not fully convinced of anyway). Jepsen tests regularly show that distributed data stores can’t actually meet their consistency and durability claims under extreme conditions, and databases are designed with more care and testing than your average business app. Regarding performance and utilization, I’m used to seeing very high latencies in distributed systems. Profiling and tuning them is much harder than optimizing a single host. And reading a couple of MBs of data off a co-located sqlite db is going to be much faster than sending a request to a remote store then playing whisper down the alley with serialized/deserialized JSON messages that are sent over flakey network connections.
Unless the problem the system solves is dead simple and is clearly better solved by distributing it (e.g. object storage like S3 or CDNs), it’s going to take a lot to convince me.
(sorry this turned out to be long and anti-scaling, when it’s not meant to be. Please don’t take this as don’t scale. Take this as: don’t just assume scaling will fix your issues, because it won’t do that on its own, like the article makes it sound to me)
Yes and no.
This argument is silly and reductive.
The same could be said about the article.
This argument is based on a kernel of truth
The same could be said about the article.
Here is why:
Yes, availability is/can be another feature. However, I’ve witnessed multiple instances where the added complexity was to blame for missing availability targets. More complex systems mean more things to potentially go wrong.
Easy maintenance targets might become hard. Also scalability is not a zero or one thing. Being able to deploy something on two, ten, five, a hundred or thousands of instances for many real life, non-trivial and non-new applications is not the same.
Scaling comes with overheads that are often initially not considered. This reaches from performance topics to deployment times and other issues.
Disaster recovery cane be a lot harder and it can be easy to with big infrastructure management tools do something wrong. Of course that’s avoidable in day to day business, but when you have to hack to quickly get things up again (something that will happen eventually) the results of mistakes also scale.
Then there sometimes a bit off math when calculating given random likelyhoods of a physical computer failing. For example when you use a server as load balancer and two application servers if you only go for physical availability (eg. hard drives dying, network card having issues, PSU failing, etc.) you are more likely to be down, because you switch from one server that must me up (your single application server) to one server being up (load balancer) plus at least one of the application servers being up. Of course that calculation is different assuming that what is most likely down is one application, but if they all run the same code that’s certainly not a given.
Durability. Making multiple copies of data on multiple machines is the only credible way to make online data highly durable. Single-system approaches like RAID are based on a fundamentally flawed assumption about failure independence. Offline data can be made durable with offline backups, of course.
I think that’s a weak point. “can be made durable with offline backups, of course” is really not how anyone should ever approach durability. Having proper, tested backup is the absolute minimum you have no matter how often you replicate your data otherwise. If you have replicated data so many scenarios of losing that data involve having ZERO copies left. Eg, everything from dropping tables, databases, migrations going wrong, data issues, malicious activities, etc. DO NEVER rely on having multiple live/connected copies. This is also true for cloud storage, which is hugely overlooked. Not only do accidents happen and someone or something removes your bucket (malicious intent might or not). The most recent shot I heard was Google accidentally removing a big customer’s data. This is Google, not someone else making a mistake. Do not Assume that won’t ever happen with your S3!
Utilization. Multi-tenant systems achieve lower costs and higher utilization by packing multiple diverse workloads onto the same resources.
I hear that claim many many times, and unless it’s very specific scenarios this is rarely ever actually the case for real life systems. This is largely only theoretically correct, but unless you only care for a theoretical systems actually do the math/look at the stats. And make sure you look at the complete picture and don’t compare what you have with what you think you could do better.
Latency. By being able to spread short-term spikes of load over a larger pool of resources, distributed systems can reduce tail latencies caused by short-term system overload.
Latency might be the very reason you do want to avoid scaling horizontally vs vertically (simply getting a stronger machine) as long as you can. Even if you scale horizontally this might be why you’d prefer fewer bigger machines compared to multiple. It’s also something where not over-utilizing systems is important - again even if you don’t scale horizontally. For many real life applications (not for all though) you have situations where you don’t want to simply automatically scale horizontally, for example because your application can take a bit. Usually you can optimize for this, but that’s shifting costs from infrastructure to people/salaries.
This is a general issue I have. It assumes a perfectly setup infrastructure which you rarely end up with and for which you want to have people that really know their stuff working full time on it, which means you have the cost you try to save with utilization, but I have yet to find someone who can verifiably say that their total cost is lower employees/work time + infrastructure. Even people making money this way admit that’s usually not the case. If it is it’s usually some free tier semi one server setup.
Isolation. Building systems out of multiple components allows components to be optimized for the security properties that matter to them. For example, notice how in the Firecracker paper, Lambda’s architecture combines strongly-isolated components that execute customer code, with multi-tenant components that perform simple data-lookup tasks.
Guess what these things run on just as well: Single servers.
Changes. Perhaps the most universal requirement in systems, and one that is frequently overlooked, is the ability to deal with change. Deploying new code as the business evolves. Patching security issues. Reacting to customer issues. In distributed systems, its typical to take advantage of the high availability mechanisms of the system to allow for safe zero-impact patching and deployments. Single-machine systems are harder to change.
Yes, I agree. Sometimes. Sometimes it also makes changes harder though, sometimes a lot. And sorry but zero downtime deployments are nothing that requires multiple servers.
However, please also don’t assume there are no bugs. Having other tenants that can potentially break in on a public cloud is a real threat, and only a few years ago someone noticed that the outer physical machine on AWS wasn’t patched against some older Intel CPU bugs. Amazon said that a few instances were missed somehow. Don’t assume big companies have a magic preventing them from bugs, security issues, making mistakes, etc. Yes, they are good at their jobs, yes they probably have people smarter than you, but also yes, they also make mistakes, especially because they deal with A LOT MORE complexity and make a much more valuable target, and while they do have a lot of smart people (not all of them will work for your product though) if they are big they probably also have a lot of dumb people, people that don’t care, that are burned out and might hate their employer. So again: Don’t expect magic, don’t think only gods work there, don’t expect no mistakes, bugs, errors, etc.
And overall: Once you move to multiple server, you can do that by adding yet another single server. Or a dozen or a million.
My overall point is that these things are all trade offs: Trade-offs of costs, trade-offs of overheads, trade-offs of complexities, trade-offs of complexity and trade-offs of failure scenarios. Many things are also somewhat application dependent and that’s why companies mix single servers and horizontally scaled.
Don’t get me wrong: Do scale out. This isn’t really a one or the other is better. The reason why people argue pro single servers is that a lot of the time it makes more sense. Also if you design your software well enough scaling out is something that you absolutely can do at a later point and something that you can do incrementally.I’d argue most new software written today needs extremely little work, or only some config changes if you don’t go from one machine to hundreds or thousands. Yes you might (or might not) need to change the deployment.
But also: Maybe actually sit down and see what you really need, what the risks are and don’t make scaling something that will magically solve all your problems.
If you’re the one responsible for fixing and maintaining and upgrading and securing a large complex distributed system your advocating for, then that’s fine. You can maintain it.
I would rather not do those things incurred by that decision. An architect deciding these things isn’t the one doing the work. You might find the person’s with hands on experience with the systems the author is recommending don’t enjoy working on them.
If people leave the team and then you have to make sense of it all and that’s unpleasant.
My hobby is parallelism and multithreading and trying to come up with something new that is intuitive and easy to use. In my hobby - yes I am loading up on complexity. I am experimenting with monads right now. At work, if I’m responsible for it, it’s an ETL pipeline.
I don’t think I enjoy upgrading microservice architectures or Kubernetes.
You might find the person’s with hands on experience with the systems the author is recommending don’t enjoy working on them.
I have a lot of hands-on experience with these systems, and I enjoy working on them a great deal.
Here’s a practical example: about 15 years ago I was working on a system where the nights-and-weekends operational load was quite high. When we dived into why that was, what we found was that the (fairly large, partitioned) system was designed so that any single host failure caused an outage for some set of customers. Over the following months we redesigned it, moved to sync replication, automated failover, and got to the point that any single host failure had no customer impact, and could be handled at 9AM on Monday. Eventually even that handling was automated. That was a huge quality-of-life improvement for both us and our customers. That was only possible because it’s a distributed, replicated, system.
Like most discourse on software scaling, this seems like this focuses on web services. However, I’ve been working a lot more with Spark recently and I have been wondering if big machines might be better for some workflows. Going over the properties listed in the article:
availability: Less important for a scheduled offline job. Downtime in minutes might not make a difference for a multi-hour job. It’s also fine if spinning up a machine for a job takes minutes.
durability: Less important on the job itself, but still important for the input and output data.
utilization: This seems like a potential blocker. Some languages aren’t equipped to use the resources of a large machine (e.g., python, node). Some services are going to rate limit each IP so you need to scale to multiple machines.
latency: Single jobs don’t care about latency spikes.
specialization: You can pick the right VM for the workload.
isolation: If you are running a job on a single VM then it seems plenty isolated.
changes: In the real world, workloads don’t stay static so the right VM can change over time. Frameworks like Spark are better equipped to deal with that variation over time because they are chunking and distributing the work.
The big win would be decreasing network traffic in the job. You don’t have to think about broadcasting or shuffling data. All the data lives in memory on a single machine. There could also be knock-on effects for memory and CPU since you are serializing and deserializing data that is transferred.
Obviously, the impact this would have on your job depends on the profile of your job. If it is embarrassingly parallel, you are probably safe with a cluster based approach. If it is really network bound in its operation, I wouldn’t be surprised if a single node was much faster. (If it is network bound in its initialization, you probably have to stick with a cluster in order to load the data from multiple machines.)
Ultimately, if your job is 10x faster, I think you can afford to have less durability since it is much easier to re-run the job.
Now, everything I have said is nice and all, but the problem is you might not know the performance characteristics of your job until you have written it. That means when you are beginning your project you have to make a decision: do I use the general purpose framework or do I start with something specialized?
Premature optimization being the root of all evil, you probably start with the general purpose framework. I think right now, at a certain scale of company/data, that means you pick Spark. Then you write your job, you see the performance characteristics, you recognize an opportunity to improve performance by an order of magnitude by moving to a single machine, and… you are forced to stick with Spark because no one wants this one off weirdo job that runs on a single machine. It’s not worth the maintenance burden. :|
It’s a shame really because dealing with a cluster doesn’t provide the same immediate feedback of working with a single process. I would much prefer to run the-same-job-on-my-laptop-but-bigger, but there are socio-technical constraints that make it infeasible.
** EDIT **
To clarify one point: Yes, you can run Spark jobs locally. However, it feels more like you are running a cluster on your computer. I would like something closer to taking my small process and making it bigger than taking a big process and making it smaller, if that makes sense.
OP is an engineer at AWS and his arguments make sense from the standpoint of a large tech conglomerate with high availability requirements (yes, including us-east-1 ).
Not all companies are at the stage where throwing bodies at DevOps for the sake of tail latency makes economical sense.
He didn’t really make an argument from the standpoint of a large tech conglomerate; he just said “distributed systems confer these other properties besides scale”. If you care about those properties (whether or not you are a large conglomerate) you should consider distributed systems.
In particular, Kubernetes has matured quite a lot; it’s pretty easy to get a Kubernetes cluster up and running even at the small scale, and you get a bunch of stuff for free that you would have to cobble together yourself from disparate tools if you were managing everything on one or two hosts. Additionally, because Kubernetes is standardized, you benefit from a vast pool of personnel who are experienced with it as well as online articles and StackOverflow posts that target it in particular.
The whole point of the post was that performance isn’t the only benefit of distributed systems.
I feel like this article is bound to confuse and get people talking past each other, because
I think it would be better to
For example this is a very broad claim about availability, that I think really needs a bunch of caveats about what happens with specific applications in practice:
When I click on the link - https://brooker.co.za/blog/2023/09/08/exponential.html
It acknowledges precisely when this isn’t true – correlated failures.
The problem is that cloud systems now have single points of failure that by themselves are more complex than a single machine.
A recent example is Unisuper being down for multiple days because of an administrative error in Google cloud. That is an example of a distributed system — across multiple data centers as I understand it – failing like a single machine.
It’s a correlated failure – https://news.ycombinator.com/item?id=40304666
On the other hand, I’m not exactly anti-cloud – I definitely don’t want the situation that sourcehut faced – hosting provider unable to handle DDoS, having to ship physical servers in the mail, and then losing the servers:
https://sourcehut.org/blog/2024-06-04-status-and-plans/
So there is an extremely complex set of tradeoffs here, and I don’t think the article is acknowledging that – I think it is kind of reacting to some people “taking a side” by “taking the other side”
I think we haven’t figured it out, and building distributed systems is still a big pain that requires a ton of work from many people. And often you still get a pretty bad result. I think that is what the “single machine” camp is reacting to.
I do like @trishume’s “Twitter on one machine” post because it’s concrete, talking about a specific application, and so more likely to lead to an illuminating discussion - https://thume.ca/2023/01/02/one-machine-twitter/
When I saw sourcehut’s post, I immediately thought about all of the anti-cloud posts that have come up recently. It is a good reminder that there are no easy choices. It’s tradeoffs all the way down.
Yeah unfortunately the thread got deleted - https://lobste.rs/s/6sf46r/state_sourcehut_our_plans_for_future#c_kakzv4
Here was the first part of my comment …
As perhaps one of the somebody’s making the argument that they refer to, I’d like to say that my point has always been that scale and security are the only necessary reasons for having more than 3 computers (which all run the same workload in sync for durability and reliability, hence the workload needs to fit on one machine).
The other reasons the post gives are all very real in terms of practical tradeoffs today, which is why I don’t actually recommend people build single-machine systems. But they’re tradeoffs due to inadequate software rather than due to benefits from additional hardware. Some of this tech already exists inside trading companies, although not all of it.
I unfortunately have never fully written up all the parts of my vision for what the software could look like to address all the other considerations in a way such that 3-machine systems could be faster, easier and simpler than many-machine systems.
I don’t think this is true, at least for availability and durability.
Let’s take availability, for example: there’s a “small number” effect if you have two or three machines where, to survive any failure at all, you need either 2x or 1.5x extra capacity. That caps utilization at 50% or 66%, respectively. Systems with larger numbers of machines can achieve significantly higher utilization (and lower cost), depending on the failure rate of their hardware and the achievable repair rate.
Durability is even more stark: using reasonable parameters typical of modern storage devices, a 3-of-6 code offers about 8 orders of magnitude more durability than 2-replication at the same overhead, and 3 orders of magnitude better than 3-replication at half the overhead (on storage). That’s at the cost of IOPS, of course, and the independence assumption, but shows that there’s a lot of room for exponential improvement beyond 3 machines.
Now, I would agree that the software and services aren’t there yet (it’s why I’ve spent the last decade on my career on serverless).
Most systems are not mission critical. They don’t require high-availability and crazy durability guarantees. Distributed systems incur huge complexity and operational costs that, assuming they actual work are probably an order of magnitude more expensive for every extra 9 of uptime/reliability. This saves you from what? Maybe a few minutes of additional downtime a year? I’d rather have a few hours of downtime for a monolith that’s hosted on one machine (maybe with a prod-parallel backup host that I can cut over to if availability actually matters) that’s easy to debug by reading a few log files than deal with a microservice-laden enterprise monstrosity that’s partially broken all the time, thanks.
And that’s assuming distributed systems actually do provide more reliability (which I’m not fully convinced of anyway). Jepsen tests regularly show that distributed data stores can’t actually meet their consistency and durability claims under extreme conditions, and databases are designed with more care and testing than your average business app. Regarding performance and utilization, I’m used to seeing very high latencies in distributed systems. Profiling and tuning them is much harder than optimizing a single host. And reading a couple of MBs of data off a co-located sqlite db is going to be much faster than sending a request to a remote store then playing whisper down the alley with serialized/deserialized JSON messages that are sent over flakey network connections.
Unless the problem the system solves is dead simple and is clearly better solved by distributing it (e.g. object storage like S3 or CDNs), it’s going to take a lot to convince me.
(sorry this turned out to be long and anti-scaling, when it’s not meant to be. Please don’t take this as don’t scale. Take this as: don’t just assume scaling will fix your issues, because it won’t do that on its own, like the article makes it sound to me)
Yes and no.
The same could be said about the article.
The same could be said about the article.
Here is why:
Yes, availability is/can be another feature. However, I’ve witnessed multiple instances where the added complexity was to blame for missing availability targets. More complex systems mean more things to potentially go wrong.
Easy maintenance targets might become hard. Also scalability is not a zero or one thing. Being able to deploy something on two, ten, five, a hundred or thousands of instances for many real life, non-trivial and non-new applications is not the same.
Scaling comes with overheads that are often initially not considered. This reaches from performance topics to deployment times and other issues.
Disaster recovery cane be a lot harder and it can be easy to with big infrastructure management tools do something wrong. Of course that’s avoidable in day to day business, but when you have to hack to quickly get things up again (something that will happen eventually) the results of mistakes also scale.
Then there sometimes a bit off math when calculating given random likelyhoods of a physical computer failing. For example when you use a server as load balancer and two application servers if you only go for physical availability (eg. hard drives dying, network card having issues, PSU failing, etc.) you are more likely to be down, because you switch from one server that must me up (your single application server) to one server being up (load balancer) plus at least one of the application servers being up. Of course that calculation is different assuming that what is most likely down is one application, but if they all run the same code that’s certainly not a given.
I think that’s a weak point. “can be made durable with offline backups, of course” is really not how anyone should ever approach durability. Having proper, tested backup is the absolute minimum you have no matter how often you replicate your data otherwise. If you have replicated data so many scenarios of losing that data involve having ZERO copies left. Eg, everything from dropping tables, databases, migrations going wrong, data issues, malicious activities, etc. DO NEVER rely on having multiple live/connected copies. This is also true for cloud storage, which is hugely overlooked. Not only do accidents happen and someone or something removes your bucket (malicious intent might or not). The most recent shot I heard was Google accidentally removing a big customer’s data. This is Google, not someone else making a mistake. Do not Assume that won’t ever happen with your S3!
I hear that claim many many times, and unless it’s very specific scenarios this is rarely ever actually the case for real life systems. This is largely only theoretically correct, but unless you only care for a theoretical systems actually do the math/look at the stats. And make sure you look at the complete picture and don’t compare what you have with what you think you could do better.
Latency might be the very reason you do want to avoid scaling horizontally vs vertically (simply getting a stronger machine) as long as you can. Even if you scale horizontally this might be why you’d prefer fewer bigger machines compared to multiple. It’s also something where not over-utilizing systems is important - again even if you don’t scale horizontally. For many real life applications (not for all though) you have situations where you don’t want to simply automatically scale horizontally, for example because your application can take a bit. Usually you can optimize for this, but that’s shifting costs from infrastructure to people/salaries.
This is a general issue I have. It assumes a perfectly setup infrastructure which you rarely end up with and for which you want to have people that really know their stuff working full time on it, which means you have the cost you try to save with utilization, but I have yet to find someone who can verifiably say that their total cost is lower employees/work time + infrastructure. Even people making money this way admit that’s usually not the case. If it is it’s usually some free tier semi one server setup.
Guess what these things run on just as well: Single servers.
Yes, I agree. Sometimes. Sometimes it also makes changes harder though, sometimes a lot. And sorry but zero downtime deployments are nothing that requires multiple servers.
However, please also don’t assume there are no bugs. Having other tenants that can potentially break in on a public cloud is a real threat, and only a few years ago someone noticed that the outer physical machine on AWS wasn’t patched against some older Intel CPU bugs. Amazon said that a few instances were missed somehow. Don’t assume big companies have a magic preventing them from bugs, security issues, making mistakes, etc. Yes, they are good at their jobs, yes they probably have people smarter than you, but also yes, they also make mistakes, especially because they deal with A LOT MORE complexity and make a much more valuable target, and while they do have a lot of smart people (not all of them will work for your product though) if they are big they probably also have a lot of dumb people, people that don’t care, that are burned out and might hate their employer. So again: Don’t expect magic, don’t think only gods work there, don’t expect no mistakes, bugs, errors, etc.
And overall: Once you move to multiple server, you can do that by adding yet another single server. Or a dozen or a million.
My overall point is that these things are all trade offs: Trade-offs of costs, trade-offs of overheads, trade-offs of complexities, trade-offs of complexity and trade-offs of failure scenarios. Many things are also somewhat application dependent and that’s why companies mix single servers and horizontally scaled.
Don’t get me wrong: Do scale out. This isn’t really a one or the other is better. The reason why people argue pro single servers is that a lot of the time it makes more sense. Also if you design your software well enough scaling out is something that you absolutely can do at a later point and something that you can do incrementally.I’d argue most new software written today needs extremely little work, or only some config changes if you don’t go from one machine to hundreds or thousands. Yes you might (or might not) need to change the deployment.
But also: Maybe actually sit down and see what you really need, what the risks are and don’t make scaling something that will magically solve all your problems.
If you’re the one responsible for fixing and maintaining and upgrading and securing a large complex distributed system your advocating for, then that’s fine. You can maintain it.
I would rather not do those things incurred by that decision. An architect deciding these things isn’t the one doing the work. You might find the person’s with hands on experience with the systems the author is recommending don’t enjoy working on them.
If people leave the team and then you have to make sense of it all and that’s unpleasant.
My hobby is parallelism and multithreading and trying to come up with something new that is intuitive and easy to use. In my hobby - yes I am loading up on complexity. I am experimenting with monads right now. At work, if I’m responsible for it, it’s an ETL pipeline.
I don’t think I enjoy upgrading microservice architectures or Kubernetes.
I have a lot of hands-on experience with these systems, and I enjoy working on them a great deal.
Here’s a practical example: about 15 years ago I was working on a system where the nights-and-weekends operational load was quite high. When we dived into why that was, what we found was that the (fairly large, partitioned) system was designed so that any single host failure caused an outage for some set of customers. Over the following months we redesigned it, moved to sync replication, automated failover, and got to the point that any single host failure had no customer impact, and could be handled at 9AM on Monday. Eventually even that handling was automated. That was a huge quality-of-life improvement for both us and our customers. That was only possible because it’s a distributed, replicated, system.
Did you build that functionality yourself?
I am uncomfortable inheriting a large complex system because I didn’t build it.
I enjoy working on systems I built myself.
When someone leaves that environment, lots of knowledge goes away and velocity drops because of the complexity. It may even become legacy :-(
Like most discourse on software scaling, this seems like this focuses on web services. However, I’ve been working a lot more with Spark recently and I have been wondering if big machines might be better for some workflows. Going over the properties listed in the article:
The big win would be decreasing network traffic in the job. You don’t have to think about broadcasting or shuffling data. All the data lives in memory on a single machine. There could also be knock-on effects for memory and CPU since you are serializing and deserializing data that is transferred.
Obviously, the impact this would have on your job depends on the profile of your job. If it is embarrassingly parallel, you are probably safe with a cluster based approach. If it is really network bound in its operation, I wouldn’t be surprised if a single node was much faster. (If it is network bound in its initialization, you probably have to stick with a cluster in order to load the data from multiple machines.)
Ultimately, if your job is 10x faster, I think you can afford to have less durability since it is much easier to re-run the job.
Now, everything I have said is nice and all, but the problem is you might not know the performance characteristics of your job until you have written it. That means when you are beginning your project you have to make a decision: do I use the general purpose framework or do I start with something specialized?
Premature optimization being the root of all evil, you probably start with the general purpose framework. I think right now, at a certain scale of company/data, that means you pick Spark. Then you write your job, you see the performance characteristics, you recognize an opportunity to improve performance by an order of magnitude by moving to a single machine, and… you are forced to stick with Spark because no one wants this one off weirdo job that runs on a single machine. It’s not worth the maintenance burden. :|
It’s a shame really because dealing with a cluster doesn’t provide the same immediate feedback of working with a single process. I would much prefer to run the-same-job-on-my-laptop-but-bigger, but there are socio-technical constraints that make it infeasible.
** EDIT ** To clarify one point: Yes, you can run Spark jobs locally. However, it feels more like you are running a cluster on your computer. I would like something closer to taking my small process and making it bigger than taking a big process and making it smaller, if that makes sense.