The interesting thing to me is that his alternate makes more “assumptions” than what he’s comparing to. He can “assume” a fixed goal and optimize the code for that goal. The “big data systems” are trying to be general purpose compute platforms and have to pick a model that supports the widest range of possible problems.
At the end of Mike Acton’s now infamous “Data-Oriented Design and C++” talk at CppCon 2014, the second question he is asked (around the 1:11:20 mark on YouTube) is: “All of these examples that you provided about optimizing for the cache line are very extraordinary, it’s very very interesting. Totally agree with that. I find it extremely hard to deal with when we’re dealing with different platforms that we have no idea where the program is going to run, how big the cache line is going to be, or anything like that”
Acton interrupts the question asker “Sorry, I’m going to interrupt you before you get lost… ‘I have so many platforms that I don’t know the characteristics of those platforms’ that’s not true, it may be true that you don’t know, but it isn’t true that there isn’t a finite set of characteristics, finite set of range. It should be within some range, you should know the min, the max, the average of the things you’re likely to be dealing with. You should know the common case. You’re not going to be putting the same solution on a Z80 [8-bit processor launched in 1976] and a Google server farm. It’s unlikely that this gonna solve that range of problem space. So what is the finite space you’re working with? What is the finite set of chip designs you’re going to be working with? What are [their] requirements?… There is a finite set of requirements that you have to have, you need to articulate them, you need to understand what that range is. This idea of general portability is a fool’s errand, honestly.”
Acton, Muratori and Blow live in their own little world (which is fine) and are extremely arrogant about it and don’t acknowledge the existence of people who don’t work in their domain (that’s bad, if anyone is wondering).
SQLite does not get to make a lot of assumptions about what it runs on. It certainly can try to optimize for the typical case (not sure what that is, maybe an Android mobile phone?), but even that probably isn’t the majority of its deployments.
Fwiw, In my dayjob, I get to make a lot of assumptions: I’m writing code for a specific JVM version. It will only run on cloud servers or developer laptops.[0] The RAM will probably be within a 2-4x range of a median value, etc. So, in fact, I’m probably closer to Acton’s world in that regard than I am SQLite. Of course, I’m also much more at the whims of actual users–I’m not writing a game where there are plausible constraints on how many of each object will be in memory. On any given instance, I may have 100 users doing X, and 0 doing Y, or vice versa.
[0] Well, excepting the one project that will only run in CI or on end-user laptops on another specific version, but same principle.
This suggests that if you can find a limit somewhere else, you can get both high horizontal scaling and high efficiency. Supposedly the Tigerbeetle database has both, but that could be because they limit all records to accounts and transfers. This means every record fits in exactly 128 bytes.
I am not sure this intends to say what it says, but, no, TigerBeetle is not horizontally scalable in the common sense of the term. TigreBeetle is single-threaded (but very concurrent)! It’s is still very fast in absolute sense, for the COST reasons (horizontal scalability is at odds with contention, and accounting is very contentious), but you can’t make it faster by just adding more machines.
TigerBeetle is a cluster of six replicas for fault tolerance&durability reasons, not to make it go faster (excluding a funny theoretical edge case where, due to tail latency tolerance, a six-replica cluster makes progress when the three fastest replicas make progress, which, on average, is faster than an average of a single replica).
It intends to say the incorrect thing. I was noodly these ideas over in public somewhere and someone said “ah, Tigerbeetle is both efficient AND scales to multiple machines!” So I read more and tried to figure out if that actually meant more machines = more compute. Guess I got it wrong!
Ah, good! I was fearing that our own messaging somewhere is misleading! As a rule of thumb, TigerBeetle’s just weird and easy to misunderstand/misinterpret, so it’s good to double-check third parties!
Maybe there’s also a cultural element to this conflict. What if the engineers interested in “efficiency” are different from the engineers interested in “horizontal scaling”?
[…]
The highest-tier of horizontal scaling is usually something large businesses want, and large businesses like problems that can be solved purely with money. Maximizing efficiency requires a lot of knowledge-intensive human labour, so is less appealing as an investment.
This is IMO the biggest contributing factor, increasing resources is a repeatable operation which can be done in a known amount of time, optimising a piece of code is risky and it might not pan out. If the company aims to minimize risk, (which is what most companies do) they’ll optimise for lateral scalability in order to decrease the likelihoon of being taken down due to a sudden increase in traffic.
In comparison, increasing efficiency might make it possible to save a few tens of thousands in server costs, at the cost of having to hire engineers specialised in low-level optimisations and keeping them happy, because if they leave you might not be able to replace them with fresh grads and keep your per-node efficiency.
This is clearly a losing move on the business side unless efficiency is what you sell (fintech, DB vendors, etc).
If the company aims to minimize risk, (which is what most companies do) they’ll optimise for lateral scalability in order to decrease the likelihoon of being taken down due to a sudden increase in traffic.
I don’t think it’s that clear cut? Adding or increasing horizontal scaling also introduces/increases several kinds of performance and correctness risks around network partitions, contention for shared resources, fail over strategies, tail latency, concurrency bugs etc.
I feel like it’s a matter of trading off one set of risks vs another, not around one being less risky and the other being more risky.
I mostly meant my comment as a general explanation as to why laterally scalable architecture are generally favoured by companies. And why more resources and time is spent on adopting technologies that can help you scale laterally vs investing into improving performance.
(Being the devil’s advocate) I’d much rather work on performance optimisation myself.
increasing efficiency might make it possible to save a few tens of thousands in server costs
Your “might” seems excessively skeptical to me. Dan Luu has written several times about making optimizations that saved 10x his salary per year, and this kind of efficiency improvement is why large scale service providers hire engineers that work on the Linux kernel or the Java runtime or develop JITs for Ruby.
This evening one of my friends was talking about optimizing a heavily-used object from 40 bytes to 32 bytes. He works on a proprietary JavaScript runtime, for a service that has a lot of embarrassingly parallel scalability, but it’s at a large enough scale that small efficiencies have big enough multipliers that they are worth the effort.
Betteridge’s law of headlines is true for this essay.
(Splitting single-threaded software across multiple threads/processes is sort of a blend of (2) and (3).)
The line between (1) and (3) is kinda arbitrary, at least from an algorithmic perspective, in that there’s a significant overlap between the best approaches at the high end of (1) and at the tightly-coupled end of (3).
The big benefit of (1) is that we (usually) don’t have to make any changes to the software to get a speedup.
You should meet my dearest enemy, NUMA. Also all kinds of things with cache hierarchies in modern multi-core and multi-socket, and SMT (another enemy of mine), and cache coherence limits, and hidden limits like memory bandwidth, and so on and so on.
Machines can’t share cache, RAM, or disk, period.
Sure they can. Shared disk is common (NFS, Lustre, etc). Shared RAM is fairly common, although less so, and not necessarily good for most workloads. Shared CPU caches have existed, but I don’t know of a common system with them. Shared page caches are a very common design pattern in databases (e.g. https://www.cidrdb.org/cidr2023/papers/p50-ziegler.pdf).
Scaling also means that separate machines can’t reuse resources like database connections.
This is a problem, but mostly a silly problem because of the legacy cost of DB connections. There’s no fundamental reason connections should be expensive or limited at this level.
Distribution also means you have fewer natural correctness guarantees, so you need more administrative overhead to avoid race conditions.
I don’t believe this is true.
If we know the exact sequence of computations, we can aim to minimize cache misses.
Sure, but it’s not clear why this is possible in a local context and not a distributed one (and, in fact, in may be easier in the distributed context). One example of how it’s easier in the distributed context is snapshotting in MemoryDB (https://brooker.co.za/blog/2024/04/25/memorydb.html).
But then you lose the assumption “recently inserted rows are close together in the index”, which I’ve read can lead to significant slowdowns.
Or significant speed ups because you avoid false sharing!
Maybe there’s also a cultural element to this conflict. What if the engineers interested in “efficiency” are different from the engineers interested in “horizontal scaling”?
This is, of course, a false dichotomy. Distributed systems don’t only (or even primarily, in most cases) exist for scaling, but availability, resilience, durability, business continuity, and other concerns.
To make this purely about scalability is naive.
I’m not sure where this fits in but scaling a volume of tasks conflicts less than scaling individual tasks
Indeed. Coordination avoidance is the fundamental mechanism of scaling. This is visible in CALM, in Amdahl’s law, and in many of the other frameworks for thinking through this space.
If you have 1,000 machines and need to crunch one big graph, you probably want the most scalable algorithm. If you instead have 50,000 small graphs, you probably want the most efficient algorithm, which you then run on all 1,000 machines
False dichotomy again. Algorithms can be both efficient and scalable, and the true shape of the trade-off between them is both super interesting and an ongoing research area.
I normally enjoy Hillel’s writing and thoughtfulness, but this post seems like a big miss.
At the time I was like “yeah, you’re right, big data IS stupid!” But I think now that we both missed something obvious: with the “scalable” solution, the data scientists didn’t have to write an optimized script for every single query. Optimizing code is hard, adding more machines is easy!
There’s a fallacy here that the data scientists needed to do any optimization. The cliché is “my awk script outperforms your hadoop cluster”. And that’s true a fortiori for a dataset of a few dozen gigabytes that fits in RAM of a machine that’s more affordable than a hadoop cluster even when you also buy a hot spare.
I think this is an interesting question, but can be shown to be “not always” via a counter-example. For instance, the Kademlia distributed hash-table algorithm is equally efficient when N=1 [N := number of nodes] as a normal hash-table (there is no need to lookup which node the key is on; it can only ever be on one node).
The point still stands that many distributed solutions to a problem are slower than single-node algorithms depending on the size of the problem (the most famous being the pagerank example described in the article), but this is generally when comparing running the solution on multiple computers vs. one node, and not when running the distributed algorithm at N=1. Still, a distributed solution only makes sense if the problem can be constrained in some way such that there is a significant parallelizable portion, which may or may not lead to the most efficient single-node solution.
My intuition here is that whether a distributed algorithm at N=1 is as efficient as a single-node algorithm is dependent on the problem space and no relationship exists. Certainly a distributed algorithm with N>1 vs. a single-node algorithm when the data-size is small will be slower, but that is simply due to the (somewhat) constant factor overhead of coordination.
At the end of Mike Acton’s now infamous “Data-Oriented Design and C++” talk at CppCon 2014, the second question he is asked (around the 1:11:20 mark on YouTube) is: “All of these examples that you provided about optimizing for the cache line are very extraordinary, it’s very very interesting. Totally agree with that. I find it extremely hard to deal with when we’re dealing with different platforms that we have no idea where the program is going to run, how big the cache line is going to be, or anything like that”
Acton interrupts the question asker “Sorry, I’m going to interrupt you before you get lost… ‘I have so many platforms that I don’t know the characteristics of those platforms’ that’s not true, it may be true that you don’t know, but it isn’t true that there isn’t a finite set of characteristics, finite set of range. It should be within some range, you should know the min, the max, the average of the things you’re likely to be dealing with. You should know the common case. You’re not going to be putting the same solution on a Z80 [8-bit processor launched in 1976] and a Google server farm. It’s unlikely that this gonna solve that range of problem space. So what is the finite space you’re working with? What is the finite set of chip designs you’re going to be working with? What are [their] requirements?… There is a finite set of requirements that you have to have, you need to articulate them, you need to understand what that range is. This idea of general portability is a fool’s errand, honestly.”
Acton, Muratori and Blow live in their own little world (which is fine) and are extremely arrogant about it and don’t acknowledge the existence of people who don’t work in their domain (that’s bad, if anyone is wondering).
SQLite does not get to make a lot of assumptions about what it runs on. It certainly can try to optimize for the typical case (not sure what that is, maybe an Android mobile phone?), but even that probably isn’t the majority of its deployments.
Or another example: Hotspot only recently deprecated the 32bit windows port: https://openjdk.org/jeps/449.
Fwiw, In my dayjob, I get to make a lot of assumptions: I’m writing code for a specific JVM version. It will only run on cloud servers or developer laptops.[0] The RAM will probably be within a 2-4x range of a median value, etc. So, in fact, I’m probably closer to Acton’s world in that regard than I am SQLite. Of course, I’m also much more at the whims of actual users–I’m not writing a game where there are plausible constraints on how many of each object will be in memory. On any given instance, I may have 100 users doing X, and 0 doing Y, or vice versa.
[0] Well, excepting the one project that will only run in CI or on end-user laptops on another specific version, but same principle.
I am not sure this intends to say what it says, but, no, TigerBeetle is not horizontally scalable in the common sense of the term. TigreBeetle is single-threaded (but very concurrent)! It’s is still very fast in absolute sense, for the COST reasons (horizontal scalability is at odds with contention, and accounting is very contentious), but you can’t make it faster by just adding more machines.
TigerBeetle is a cluster of six replicas for fault tolerance&durability reasons, not to make it go faster (excluding a funny theoretical edge case where, due to tail latency tolerance, a six-replica cluster makes progress when the three fastest replicas make progress, which, on average, is faster than an average of a single replica).
It intends to say the incorrect thing. I was noodly these ideas over in public somewhere and someone said “ah, Tigerbeetle is both efficient AND scales to multiple machines!” So I read more and tried to figure out if that actually meant more machines = more compute. Guess I got it wrong!
Ah, good! I was fearing that our own messaging somewhere is misleading! As a rule of thumb, TigerBeetle’s just weird and easy to misunderstand/misinterpret, so it’s good to double-check third parties!
This is IMO the biggest contributing factor, increasing resources is a repeatable operation which can be done in a known amount of time, optimising a piece of code is risky and it might not pan out. If the company aims to minimize risk, (which is what most companies do) they’ll optimise for lateral scalability in order to decrease the likelihoon of being taken down due to a sudden increase in traffic.
In comparison, increasing efficiency might make it possible to save a few tens of thousands in server costs, at the cost of having to hire engineers specialised in low-level optimisations and keeping them happy, because if they leave you might not be able to replace them with fresh grads and keep your per-node efficiency.
This is clearly a losing move on the business side unless efficiency is what you sell (fintech, DB vendors, etc).
I don’t think it’s that clear cut? Adding or increasing horizontal scaling also introduces/increases several kinds of performance and correctness risks around network partitions, contention for shared resources, fail over strategies, tail latency, concurrency bugs etc.
I feel like it’s a matter of trading off one set of risks vs another, not around one being less risky and the other being more risky.
I mostly meant my comment as a general explanation as to why laterally scalable architecture are generally favoured by companies. And why more resources and time is spent on adopting technologies that can help you scale laterally vs investing into improving performance.
(Being the devil’s advocate) I’d much rather work on performance optimisation myself.
Your “might” seems excessively skeptical to me. Dan Luu has written several times about making optimizations that saved 10x his salary per year, and this kind of efficiency improvement is why large scale service providers hire engineers that work on the Linux kernel or the Java runtime or develop JITs for Ruby.
This evening one of my friends was talking about optimizing a heavily-used object from 40 bytes to 32 bytes. He works on a proprietary JavaScript runtime, for a service that has a lot of embarrassingly parallel scalability, but it’s at a large enough scale that small efficiencies have big enough multipliers that they are worth the effort.
Betteridge’s law of headlines is true for this essay.
(Extended version of my comment on this on HN)
The line between (1) and (3) is kinda arbitrary, at least from an algorithmic perspective, in that there’s a significant overlap between the best approaches at the high end of (1) and at the tightly-coupled end of (3).
You should meet my dearest enemy, NUMA. Also all kinds of things with cache hierarchies in modern multi-core and multi-socket, and SMT (another enemy of mine), and cache coherence limits, and hidden limits like memory bandwidth, and so on and so on.
Sure they can. Shared disk is common (NFS, Lustre, etc). Shared RAM is fairly common, although less so, and not necessarily good for most workloads. Shared CPU caches have existed, but I don’t know of a common system with them. Shared page caches are a very common design pattern in databases (e.g. https://www.cidrdb.org/cidr2023/papers/p50-ziegler.pdf).
This is a problem, but mostly a silly problem because of the legacy cost of DB connections. There’s no fundamental reason connections should be expensive or limited at this level.
Sometimes! There’s a whole body of research about when distributed algorithms require coordination. One example is the CALM theorem (https://arxiv.org/abs/1901.01930), and another is the ways that scalable database systems avoid read coordination (https://brooker.co.za/blog/2025/02/04/versioning.html).
I don’t believe this is true.
Sure, but it’s not clear why this is possible in a local context and not a distributed one (and, in fact, in may be easier in the distributed context). One example of how it’s easier in the distributed context is snapshotting in MemoryDB (https://brooker.co.za/blog/2024/04/25/memorydb.html).
Or significant speed ups because you avoid false sharing!
This is, of course, a false dichotomy. Distributed systems don’t only (or even primarily, in most cases) exist for scaling, but availability, resilience, durability, business continuity, and other concerns.
To make this purely about scalability is naive.
Indeed. Coordination avoidance is the fundamental mechanism of scaling. This is visible in CALM, in Amdahl’s law, and in many of the other frameworks for thinking through this space.
False dichotomy again. Algorithms can be both efficient and scalable, and the true shape of the trade-off between them is both super interesting and an ongoing research area.
I normally enjoy Hillel’s writing and thoughtfulness, but this post seems like a big miss.
Scalability! But at what COST?
There’s a fallacy here that the data scientists needed to do any optimization. The cliché is “my awk script outperforms your hadoop cluster”. And that’s true a fortiori for a dataset of a few dozen gigabytes that fits in RAM of a machine that’s more affordable than a hadoop cluster even when you also buy a hot spare.
Finally, a decent fucking question
I think this is an interesting question, but can be shown to be “not always” via a counter-example. For instance, the Kademlia distributed hash-table algorithm is equally efficient when N=1 [N := number of nodes] as a normal hash-table (there is no need to lookup which node the key is on; it can only ever be on one node).
The point still stands that many distributed solutions to a problem are slower than single-node algorithms depending on the size of the problem (the most famous being the pagerank example described in the article), but this is generally when comparing running the solution on multiple computers vs. one node, and not when running the distributed algorithm at N=1. Still, a distributed solution only makes sense if the problem can be constrained in some way such that there is a significant parallelizable portion, which may or may not lead to the most efficient single-node solution.
My intuition here is that whether a distributed algorithm at N=1 is as efficient as a single-node algorithm is dependent on the problem space and no relationship exists. Certainly a distributed algorithm with N>1 vs. a single-node algorithm when the data-size is small will be slower, but that is simply due to the (somewhat) constant factor overhead of coordination.