The answer? A pretty classic situation: the people I was hanging out with were affecting me, and in ways that, when I examined it, I didn’t like very much.
Oh, absolutely. This is, in some way, a very natural human thing: I talk shit about them, and by doing that I show that I’m one of us. And the rest of us laugh and slap me on the back, and buy me a pint, and I’m happy. But social media amps this up. Suddenly us isn’t a group around a table - it’s tens of thousands of people, and they are also there, hearing the things we’re saying about them. Sometimes they deserve it. Sometimes they are wrong. But, way too often, what we’re saying in these public forums is painful and hurtful and unnecessary. And, to Steve’s larger point, missing context that maybe would suggest that they don’t have it so wrong after all.
Other than art projects, has anyone ever deliberately chosen the wrong tool?
Yeah, I actually think this does happen.
Partially it happens because, as a field, we’re still too concerned with aesthetics and opinions, and not yet really mature about thinking about software creation as a production process. And so we pick tools based on the wrong criteria more often than we would like to admit.
Part of this is healthy. It drives the creation of better tools. Like Rust. And Go. Part is very unhealthy, because it means that we’re making decisions with emotion that shouldn’t be made that way, at least not outside of our hobbies (where, as This Old Tony says, “There’s no why in the home shop”, just do whatever makes you happy).
Anyway, the Eternal September comes for us all.
You know, them. The Johnny-come-latelies that haven’t been on the internet since the Usenet days. The Eternal September people who are ruining it for us enlightened oldheads :) Despite how thoughtful and insightful this post is, it just shows that we’re all works in progress, still learning how to communicate in the best ways.
Despite how thoughtful and insightful this post is, it just shows that we’re all works in progress, still learning how to communicate in the best ways.
Haha, part of what took me so long to write this post was that I knew I’d struggle a bit with saying the right things. And yeah, we’re surely all in-progress.
I don’t view the Eternal September as a story about old vs young, I view it as a story of how a small community can enforce norms that are difficult or impossible once the community grows to a certain size. Not a moral judgement in any way. But, maybe others view it in a different light. :)
The Eternal September story is just a classic hubris-nemesis tragedy, and indeed the only lesson I’ve been able to take from it is “keep the community small and vigorously defend your mores and values”… which is why I’m here on Lobsters. I wish I could come up with a less pessimistic interpretation, though.
There’s another reason to make sure that higher-level applications are pleasant in Rust: it means that people can build their entire stack using one technology.
I’ve seen it play out as we built Aurora DSQL - we chose Rust for the new dataplane components, and started off developing other components with other tools. The control plane in Kotlin, operations tools in Typescript, etc. Standard “right tool for the job” stuff. But, as the team has become more and more familiar and comfortable with Rust, it’s become the way everything is built. A lot of this is because we’ve seen the benefits of Rust, but at least some is because the team just enjoys writing Rust.
Examples of database that use eventual consistency are
DynamoDB
Sort-of. DynamoDB is interesting here: it’s actually a strongly consistent database (in that it’s fundamentally “CP”) but will allow readers to opt in to eventually consistent reads as a latency optimization. Writes are always synchronously replicated to a quorum of replicas, and strong consistency is always available to readers. In much the same way, Aurora PostgreSQL allows read replicas which are eventually consistent, while still ensuring that all writes are strongly consistent and ordered and written to a quorum.
Similarly to Raft, Paxos protocols are (in terms of CAP theorem) “CP”. Which means the cluster can also become unavailable.
The cluster can always become unavailable in some conditions, no matter what your software does. The only difference, really, between “CP” and “AP” systems is whether it is available on the minority side for writes and strongly consistent reading during a network partition. Nothing is stopping a “CP” system from being available and consistent on the majority side during a partition, but it has to be unavailable on the minority side.
since all writes must have a quorum to be accepted (which implies back-and-forth between the leader and followers), performance can be degraded
You don’t mention durability here, which is a key part of the picture if you’re going to decide whether or not to involve multiple servers in a write. In fact, if you weren’t worried about durability, you could invent a modified version of Raft with significantly better average-case latency properties, but with some probability of data loss on single-machine failure.
DynamoDB is interesting here […] will allow readers to opt in to eventually consistent reads
Indeed, I missed the part about “read consistency”.
The only difference, really, between “CP” and “AP” systems is whether it is available on the minority side for writes and strongly consistent reading during a network partition.
That is what I actually wanted to say though my formulation was inaccurate. In the use case of FlowG, we write more often than we read, so “write availability” is more important than “read consistency”. We want to ingest logs as fast as possible, and we trust that the pipeline will store them in the right place and/or trigger the correct webhooks, there are better tools than FlowG to actually view the data (Kibana for instance), which is why I actually use FlowG to forward logs to other systems (Datadog, Splunk, ElasticSearch, Zapier, you name it), and only store in FlowG the minimal amount of data.
You don’t mention durability here, which is a key part of the picture
Indeed, that’s a miss on my part, thanks.
Thanks a lot for the corrections! I’ll add them as notes to the article tomorrow evening :)
There are many ways to divide up the field of software. One way is between tasks where “specification first” speeds things up, and tasks where that slows things down. Many systems- and infrastructure-level tasks are in the first bucket, and most UI, frontend, and crud tasks are in the latter bucket. Many things are also a mix of the two, with some places where specifications are known up-front, and some where they are discovered during development.
Most of the heat in the world of TDD (which is one flavor of specification first) comes from folks working on these two classes of problems talking past each other. This article seems like another example of that.
This doesn’t match my experience working with junior developers. It just seems to be that new folks always have a lot to learn, and the amount of knowledge that’s needed to be productive is going up over time.
There are reasons to be concerned about the effect that tools have on the way folks learn, and how raising the level of abstraction reduces the need to learn lower-level details, but this post just seems to be overstating things significantly.
(Splitting single-threaded software across multiple threads/processes is sort of a blend of (2) and (3).)
The line between (1) and (3) is kinda arbitrary, at least from an algorithmic perspective, in that there’s a significant overlap between the best approaches at the high end of (1) and at the tightly-coupled end of (3).
The big benefit of (1) is that we (usually) don’t have to make any changes to the software to get a speedup.
You should meet my dearest enemy, NUMA. Also all kinds of things with cache hierarchies in modern multi-core and multi-socket, and SMT (another enemy of mine), and cache coherence limits, and hidden limits like memory bandwidth, and so on and so on.
Machines can’t share cache, RAM, or disk, period.
Sure they can. Shared disk is common (NFS, Lustre, etc). Shared RAM is fairly common, although less so, and not necessarily good for most workloads. Shared CPU caches have existed, but I don’t know of a common system with them. Shared page caches are a very common design pattern in databases (e.g. https://www.cidrdb.org/cidr2023/papers/p50-ziegler.pdf).
Scaling also means that separate machines can’t reuse resources like database connections.
This is a problem, but mostly a silly problem because of the legacy cost of DB connections. There’s no fundamental reason connections should be expensive or limited at this level.
Distribution also means you have fewer natural correctness guarantees, so you need more administrative overhead to avoid race conditions.
I don’t believe this is true.
If we know the exact sequence of computations, we can aim to minimize cache misses.
Sure, but it’s not clear why this is possible in a local context and not a distributed one (and, in fact, in may be easier in the distributed context). One example of how it’s easier in the distributed context is snapshotting in MemoryDB (https://brooker.co.za/blog/2024/04/25/memorydb.html).
But then you lose the assumption “recently inserted rows are close together in the index”, which I’ve read can lead to significant slowdowns.
Or significant speed ups because you avoid false sharing!
Maybe there’s also a cultural element to this conflict. What if the engineers interested in “efficiency” are different from the engineers interested in “horizontal scaling”?
This is, of course, a false dichotomy. Distributed systems don’t only (or even primarily, in most cases) exist for scaling, but availability, resilience, durability, business continuity, and other concerns.
To make this purely about scalability is naive.
I’m not sure where this fits in but scaling a volume of tasks conflicts less than scaling individual tasks
Indeed. Coordination avoidance is the fundamental mechanism of scaling. This is visible in CALM, in Amdahl’s law, and in many of the other frameworks for thinking through this space.
If you have 1,000 machines and need to crunch one big graph, you probably want the most scalable algorithm. If you instead have 50,000 small graphs, you probably want the most efficient algorithm, which you then run on all 1,000 machines
False dichotomy again. Algorithms can be both efficient and scalable, and the true shape of the trade-off between them is both super interesting and an ongoing research area.
I normally enjoy Hillel’s writing and thoughtfulness, but this post seems like a big miss.
The Rust community, on the other hand, was very dismissive on on Reddit and Lobsters.
I think you’re making a mountain out of a molehill.
The crate itself has seen a fair bit of churn: for instance 0.9 broke backwards compatibility with 0.8.
My sense of this is not very calibrated, so can someone weigh in on this? 0.8 was released more than 4 years ago, and 0.9 was released last week.
About a year ago, it looked like this:
[snip]
Not perfect, but better.
What exactly would be perfect? No dependencies at all? I mean, let’s look at what it’s actually depending on there. rand_chacha and rand_core are part of the same project, so they shouldn’t count as extra dependencies. libc is pretty foundational. getrandom is also part of rust-random and I think it’s nice that they have it separated so that people who just want random numbers from the system can do so without needing rand. cfg-if, I guess this should be part of the standard library or something? ppv-lite86 is an implementation of a cryptographic primitive, which seems fair to outsource.
None of those seem like things you’d want to remove.
There is a single-dependency crate too which can read from the system’s entropy source and that’s getrandom. So there at least could be a world where rand only depends on that.
Ok, I guess that confirms that, you think the perfect version of rand would be one that has its own implementation of ChaCha20 and which isn’t split into 3 crates. I really don’t see what difference that would make, except make the raw count of crates lower.
I didn’t bother to look at the other crates, but this one stuck out to me as seeming unlikely:
byteorder clocks in at 3,000 lines of code.
So I went and looked at the code. lib.rs is 1587 lines of tests, 1439 lines of comments, and 644 lines of code. io.rs is 1149 lines of comments and 385 lines of code. So 1029 lines of actual code that gets compiled. It’s highly repetitive, implementing the big endian and little endian read and write logic for a bunch of different types. More macros could reduce the line count, but that wouldn’t reduce the amount of code needing to be compiled nor the size of the machine code.
Maybe this points to Rust not having a large enough standard library. Perhaps features like terminal size detection and random number generation should be included. That at least is what people pointed out on Twitter.
Why would you put terminal size detection in the standard library? What? Mmmmmaybe getrandom should be included. The criteria right now is that something should be in the standard library if it has to be in the standard library, or if pretty much every program needs it. But actually doing the work takes time. Safe transmute might render many uses of zerocopy obsolete, but safe transmute isn’t done yet!
Also why would you care what people on Twitter say? Most people I’ve seen who engage in dependency discourse over there seem to be brofluencers who spend more time posting hot takes to social media than they actually spend coding.
Perhaps features like terminal size detection and random number generation should be included.
From a long-term perspective, having random number generation not in the standard library seems like the right bet. rand(3) is broken, both from an implementation quality perspective (on many, but not all, platforms), from a performance perspective (unnecessary synchronization for most applications), and from an API perspective (at least for multi-threaded and forking programs), and surprisingly weak guarantees. Those things may not have been visible to the original creators of rand, though, because they’re mostly due to newer concerns (multi-core, crypto-related stuff we’ve learned over the decades since, etc).
RNG is hard because of crypto, and simulation and other numeric stuff, and all kinds of other reasons that make it very hard to just have one good RNG that solves all problems well. It’s a deep enough problem that there’s even a ton of academic beef about the right way to do it.
Learning (or starting to learn) lean and especially following the mathematics in lean book. I really like how Lean allows me to build from the very basics up to powerful formalizations. I’m still learning: one of my goals is to formalize the FLP result in Lean, and I haven’t quite completed that yet. The IDE experience of Lean is also really great.
Completing following Andrej Karpathy’s Zero to Hero series. I really liked getting hands-on implementing things at this low level, thinking through the trade-offs, etc. I get a lot more out of reading ML papers now. I also spent some time implementing some of the major vector search algorithms, another very worthwhile exercise.
I hadn’t read Jim Gray’s A Transaction Model until this year, despite reading a lot of things that reference it. Truly a great paper.
On the non-work side, I set out to read some books I’d been recommended but was avoiding because I felt they were cliches. Seven Years in Tibet totally wasn’t the book I was expecting, and was absolutely fascinating (is it true? no idea). I also enjoyed Moneyball and Liar’s Poker, but not The Big Short. My top non-fiction of the year was The Shining Mountain. Most disappointing was Herzog’s Annapurna. He comes across as insufferable.
If you’re reading climbing books you might already be aware of it, but I enjoyed the documentary Touching the Void. It’s a gut wrenching story that thankfully ends well (not a spoiler since both climbers are involved in the media).
The movie is based on a book with the same name if that’s more you’re fancy, though I haven’t read it myself but it got awards.
This article is missing something important about the nature of queues, and why they are so successfully deployed in so many different places.
In “open” systems[1], especially systems which see traffic that’s highly bursty, seasonal, queues allow systems to successfully serve customer traffic during short periods of excess load by adding “enqueue” to the existing options of “serve now” or “reject”. Highly bursty traffic is common (e.g. on a single server, or single network link), especially at smaller scales, and so this is extremely useful.
On the other hand, if the long-term utilization exceeds 1.0 (or even approaches 1.0 [2]), then they just add latency for no benefit. In other words, if the long-term request arrival rate exceeds the long-term service rate, you’re in trouble and queues are just going to bury you under more trouble.
In “closed” and “hybrid” systems, you can handle these overloads by asking clients to slow down their request rate (as TCP does). In “open” systems, this doesn’t really help: asking a client to come back later is just an implicit queue, because it doesn’t slow down the rate of new clients entering the system (unless some clients “balk”, or just leave once asked to slow down).
As the article says, these slow down signals work less well when queues are involved, because they delay the point in time when you tell the client to slow down to after the damage is done and you’ve built up a big queue you need to drain. Draining the queue takes time, during which (depending on the queue order), the system isn’t able to do new work for more recent requests, making the overload condition last longer (and, potentially, forever).
I think it’s important to think about “open” vs “closed” and “short-term” vs “long-term” when thinking about the role of queues in distributed systems. What’s good for short-term open systems may be bad for the long-term in closed systems. You also need to know whether you’re optimizing for latency (in which case you want to keep utilization <<1.0) or throughput.
there are other many more safety properties 1 involving layout or ordering that can be modeled that don’t have anything to do with memory safety.
Temporal memory safety problems seem to be a subset of temporal safety violations more generally. You can think about “allocate” and “free” as events, and the correct protocol implementation needs to order these events appropriately to implement safety correctly. You can make the whole concurrent aspect of it go away by defining “free” as “replaces current values with random values” or even “replaces current values with adversarial chosen values”.
If you’re thinking of a TLA+-style model of this system, you could have an action that randomized the contents of any memory marked “free” (which is enabled as long as there is some memory marked “free”). Interestingly, this is much easier to model in TLA+‘s “shared memory” world than in, say, P’s “message passing” world.
But the overall point is right, terms here are important, and help folks understand.
The other claim I see in this space a lot is that temporal memory safety only matters because of type confusion (i.e. re-using the same memory across different types). That’s also not true.
It’s worth mentioning that since this thesis was written, EC2 time sync has improved by multiple orders of magnitude: https://aws.amazon.com/about-aws/whats-new/2023/11/amazon-time-sync-service-microsecond-accurate-time/ Databases from Amazon (Aurora Limitless and Aurora DSQL) and others (e.g. Yugabyte) are using these clocks in production to improve read performance and avoid co-ordination between components.
In DSQL, high quality clocks allow us to avoid all cross-replica and cross-shard coordination during reads (while stile providing strong consistency and isolation). It’s worth noting that all of these systems use some form of hybrid clock - they depend on the physical clock ranges for some properties, and depend on a logical clock approach for other properties (in DSQL read consistency uses physical clock properties, while isolation, durability, and write ordering depends on logical clock properties).
Maybe I’m old-fashioned but my teeny little local clock tends to average about 800 nanoseconds of drift. Even remotely. I’m below a millisecond. That is on the same network. Even on a distributed Network, if it’s within the same view and it’s not too far apart, I don’t see the reason for all this complexity unless they’re like 50 mi apart and then it gets more interesting. I should note I’m using a pulse per second raspberry pi with GPS.
Doing great clocks at the room/lab scale isn’t particularly difficult (although nanoseconds would be rare, even small things like people standing too near a cable can cause issues at those time scales). What’s difficult is doing it at datacenter, region, and global scale. There’s no common view, cable runs are a lot longer, failures more frequent, etc.
I spent a bunch of time at grad school distributing clocks between radar receivers at the “several steps apart” scale, and then at the “hundreds of km apart” scale, and ran into very similar issues (although there we were also using coherence with common view transmitters).
If you are a hyperscaler, you are not using async/await. For a hyperscaler, the cost of managing their server infrastructure may be in the billions. Async/await is an abstraction. You’ll not want to use it. To really get the most out of your performance, you are better of modifying the kernel to your needs.
I don’t think these people are using async/await, and for good reasons.
I obviously can’t speak for all the hyperscalers, but a lot of folks at AWS sure are using async/await, increasingly alongside io_uring to great effect. There’s always money in performance, of course, but one of the great things about the Rust ecosystem is how easy it makes it to get to “really very good” performance.
As a concrete example, when we were building AWS Lambda’s container cache, we build a prototype with hyper and reqwest and tokio talking regular HTTP. We totally expected to need to replace it with something more custom, but found that even the prototype was saturating 50Gb/s NICs and hitting our latency and tail latency targets, so just left them as-is.
In reality, your OS scheduler does the exact same thing, and probably better.
I think the reality is that, for high-performance high-request-rate applications, a custom scheduler can do much better than Linux’s general purpose scheduler without a whole lot of effort (mostly because it just knows more about the workload and its goals).
You have to be (1) working at a large organization, (2) be working on a custom web server (3) that is highly I/O bound.
Yeah, for sure. But the async model made the code clearer and simpler then the equivalent threaded code would have been, when taking everything into account (especially the need to avoid metastability problems under overload, which generally precludes a naive thread-per-request implementation).
I do appreciate the tail latency angle. There are some (synthetic) benchmarks that show async/await being superior here. (Of course this depends on the async runtime too.) On the other hand, it seems to me a bit too niche requirement to justify async/await, assuming overload is not a very common situation. I am assuming async/await being a worse experience here. And reading from your comment you did not have that experience.
Services of all sizes need to protect against overload. My understanding is that async/await enables the server to handle as much load as the CPU(s) can handle without having to tune arbitrary numbers (e.g. thread pool size), and allowing the network stack to apply natural backpressure if the load increases beyond that point.
Edit to add: If it seems that designing a service to gracefully handle or prevent overload is a niche concern, perhaps that’s because we tend to throw more hardware at our services than they really ought to need. Or maybe you’ve been lucky enough that you haven’t yet made a mistake in your client software that caused an avoidable overload of your service. I’ve done that, working on a SaaS application for a tiny company.
When using tokio (and this goes for most async runtimes) it is actually not recommended to use async/await for CPU-bound workloads. The docs recommend using spawn_blocking which ends up in a thread pool with a fixed size: https://docs.rs/tokio/latest/tokio/#cpu-bound-tasks-and-blocking-code
When using tokio (and this goes for most async runtimes) it is actually not recommended to use async/await for CPU-bound workloads.
No, that’s about hogging up a bunch of CPU time without yielding. It doesn’t apply if you have tasks that use a lot of CPU in total but yield often, or simply have so many tasks that you saturate your CPU. I’m pretty sure the latter is what @mwcampbell was referring to with “enables the server to handle as much load as the CPU(s) can handle”.
Tail latency is extremely important and undervalued. This is why GC languages are unpopular in the limit, for example — managing tail latencies under memory pressure is very difficult.
edit: of all groups of lay people, I think gamers have come to understand this the best. Gamers are quite rightly obsessed with what they call “1% lows” and “0.1% lows”.
As an AWS user, I can say you can saturate S3 get objects calls with async await pretty easily as well, to the point where there’s a few github issues about it. https://github.com/awslabs/aws-sdk-rust/issues/1136 <– essentially you have to hold your concurrency to between 50-200 depending on where you’re situated wrt s3.
What I find so frustrating about this paper is the fact it doesn’t engage with the long history of replacing TCP in the datacenter, the set of already-deployed solutions, and what we’ve learned from them. Part of that is that the solutions are mostly proprietary (but not secret, e.g. https://ieeexplore.ieee.org/document/9167399), but I can’t explain all of it.
Then it says things like:
Every aspect of TCP’s design is wrong: there is no part worth keeping.
Every aspect? Except the aspects of the design you’ve chosen to keep in your proposed solution?
I feel like if a student presented this at a conference it would get very poorly received, but this is getting a lot of airtime because a famous name is attached to it.
Traditional relational databases are pretty great, but when your use-case is well-served by them depends a great deal not only on the use-case, but also on what you consider to be ‘well served’, and what you consider to be a traditional relational database.
Here are a couple of non-scalability reasons to think about why “traditional” might not be the right approach (or, at least, to question the definition of “traditional”):
Databases tend to be extremely sensitive to cache hit rate for performance. This is sometimes good: high locality workloads can lead to extremely good performance. It’s also a sharp edge. Failing over often means losing the cache, which can take minutes or even hours to rehydrate on a new machine, causing latency issues and possibly metastable failure issues upstream. You can mitigate this by keeping failover candidates caches warm (like Aurora), but is that still “traditional”? Changes in the size or composition of your working set can cause sudden changes in performance, perhaps over multiple orders of magnitude. The size and composition of your working set may not be under your control, and may even be under the control of attackers. You can mitigate this by being able to scale cache in-place, but is that still “traditional”? You can also mitigate it by buying a lot of memory, which is an OK solution but not always the most economical (you’ve just built an ad-hoc in-memory database).
Databases have historically been large and complex attack surfaces, and handle a whole lot of ad-hoc data. Because of this, it’s a good idea to keep them patched and up-to-date, and ideally keep relatively current with new major versions. But patching and upgrading relational databases without risk requires a lot of expertise if you’re doing it yourself. You can mitigate this by picking a database service that manages upgrades for you, but is that still “traditional”? How hard upgrading is tends to be proportional to how many database features you depend on, and proportional to how painful a long period of downtime can be.
Same story with durability. Can you lose data? Should you use a service that provides PITR and backup management and so on. What’s your ransomware strategy? What’s your logical corruption (“oops, I forgot the WHERE on a DELETE”) strategy? And so on.
Good system designers have a lot to consider in the data space, and would do well to ignore thought-terminating ideas like “traditional”. It may well be that the needs of your business are well met by a simple solution, or that you can accept some operational debt now to pay in future. Or, it may well be that something on the managed service spectrum offers you a ton of operational value. At very least consider availability, security, durability, predictability of performance, and (yes) scalability when you pick.
This was a rare but brutal example of how writing non-thread-safe code can cripple your systems.
I’ve come to strongly dislike Java’s “silently allow badness” approach to thread safety, and appreciate Rust’s much more explicit approach (with things like Send+Sync). I introduced a HashMap concurrency bug that souds a whole lot like the one the poster is talking about into a Java codebase in my first year at AWS (so 2009-ish). Code compiled, code ran fine in test, code ran fine on low-scale machines, code blew up in production (latently!). Ugh.
There are some nice ways Java improved on the thread safety status quo over C and C++ (like a proper memory model and a rich standard library of good concurrent algorithms and primitives), some of which have made their way into those languages. But the fundamental approach is still way too easy to get wrong.
Go is an interesting case here because in one sense it doesn’t improve over Java. It’ll let you make all the mistakes. But, if you follow the conventions for structuring multi-threaded code you can very effectively avoid mistakes. So it’s a kind of half solution, but a very useful one.
It’s easy to sit back and say “oh, I simply wouldn’t have written those bugs”, but that’s a naive view. For example, it assumed that every piece of code you write is either fully thread safe or never ever will be refactored into a multithreaded context. Without the compiler helping you with that, I don’t know how that’s generally possible. It also assumes that you’re willing to spend a bunch of brainpower keeping track of things that a compiler can keep track of for you, and that seems inefficient at best. Computers are great at “apply these rules everywhere” in a way that humans just aren’t, and I’ve never understood the reluctance of programmers to take advantage of that.
The solution of automatically terminating random instances felt like a terrible engineering practice.
Doesn’t sound like a terrible engineering practice to me. Sounds like a pragmatic solution making the most of the tools available in the environment, which is ultimately what good engineering looks like. You don’t want to gather too many work-around like this, because each adds complexity to a system and makes it harder to reason about, but as a short-term mitigation its nothing to be embarrassed about at all.
Why not just reboot them? Terminating was faster.
I’ve really come to appreciate the approach of cycling capacity and replacing it with freshly imaged new capacity continuously. There’s an efficiency trade-off on the frequency, but if you pick a good frequency it effectively eliminates whole classes of resource exhaustion problems in cloud systems. There’s a sense of craftsmanship in “our systems shouldn’t have those resource exhaustion bugs in the first place” which isn’t entirely wrong, but does distract from spending the same resources preventing more impactful bugs (like correctness bugs).
The interesting thing about the post is that this was also a correctness bug from what it sounds like, and the explanation for why it’s OK to have that out in production for so long isn’t clear from the post. But its not generally true that resource exhaustion bugs are correctness bugs, and hitting them with the heavy hammer of replacing machines (or VMs or MicroVMs) works very well.
Sorry for the random comment, but that brought back a moment quite a while ago (it could’ve been 2007 or 2008): I was called for a performance problem on an ATG e-commerce website and in a couple of hours on the first day I found that they had a droplet which was instantiated on multiple threads, and all of those threads were calling HashMap.containsKey() on a variable also modified by multiple threads… Nothing like that to achieve wizard-like status in the eyes of your client.
I’ve come to strongly dislike Java’s “silently allow badness” approach to thread safety
You’re far from the first. Thread safety is a problem older than most people realize (definitely older than Java). Actors are a very good tried-and-true method to sharing data without race conditions if you don’t have Rust’s ownership/trait system, and you also get the “just restart it” behavior out of the box. Only downside is, of course, everything is runtime defined and you don’t get any help from the compiler in regards to coordinating access like you can with Rust.
Java itself has the third-party Akka framework, but Kotlin and Clojure also have their own built-in systems for highly parallel share-nothing workloads (I know Kotlin has actors, Clojure doesn’t but it’s because it doesn’t need them, but I’m not sure what they use instead.) Of course, everyone should be using anything but Java if targeting the JVM, there’s not really a good excuse to use Java directly anymore (well, there never was, but Sun’s marketing department made sure we got stuck with Java anyway.) Kotlin’s basically done to Java what Rust has done to C++.
Shared-nothing patterns are nice when you can get them, but most real-world systems end up needing to share state (distributed and durable state like work queues and databases, local and ephemeral state like connections and caches, etc). The thread safety bugs come in with that shared state. If you’re in a domain where there’s no shared state, then great, take advantage of that. If you’re in a domain that can be modeled with actors or CSP or whatever, great, take advantage of that. If you’re in a domain where shared state can all be made somebody else’s problem through an interface (e.g. and RDBMS through an ACID transaction interface), then take advantage of that.
But that’s not all applications, and fairly few systems programs, and so lower-level primitives for state sharing are important and useful.
well, there never was, but Sun’s marketing department made sure we got stuck with Java anyway
Java become popular at a time when most of the alternatives were significantly worse, lacking at least one of: memory safety, a real memory model, a rich standard library, a wide variety of open-source and proprietary libraries, built-in monitoring and observability, a sane cross-platform story, or decent performance. There are tons of better choices today (on and off the JVM), but the idea that Java’s only popular because Sun bamboozled us isn’t right.
Kotlin has actors, Clojure doesn’t but it’s because it doesn’t need them, but I’m not sure what they use instead.
Because Clojure data structures are immutable and persistent, concurrency is a lot less painful to deal with than it is in some other languages. I wouldn’t say it is painless but it is pretty good.
If you want something more CSP-like, Clojure also has core.async, which provides lightweight threads that communicate over channels. It looks quite a lot like Go: the lightweight threads are even declared using “go blocks”.
So if I understand this right, the intended usage is that you model the expected behavior, you instrument your real code to produce the traces and then you deploy it or fuzz it to get a lot of traces you can check for linearizability violations?
In practice what’s the easiest way to capture the traces? Just log some JSON? Derive it from otel spans?
You do need to ensure that whatever system records the history of events is itself Linearizable (hand-waving a bit here; you can actually get away with slightly weaker properties). That could be a single-node program which records invocations and completions in a Linearizable in-memory structure. Or you could journal them (synchronously!) to a Linearizable database before invocation and after completion, in production.
What you can’t do is, say, stamp log lines or OTEL events with local node clocks–those aren’t necessarily synchronized, and the checker might tell you your program violated Linearizability when it didn’t. You also can’t send them to a database asynchronously, because the message recording the invocation might be delayed and arrive at the database after the operation was actually performed. You also can’t use a Serializable or Causally consistent database, because it might fail to record your events in real-time order. A Sequential log is OK, but the scope of the Sequential object has to be be the entire log; otherwise you could (you guessed it!) get reorderings.
Another thing to be aware of is that if you log to the same datastore that’s under test, you could influence the results. For example, if your data store has an issue with stale reads in read-only transactions, and you add an “insert log line” between every read transaction down the same connection, you could see the bug become invisible.
I mean… no? Imagine you detected a Linearizability violation in a history recorded using timestamps on, say, Linux machines. From this you could conclude that either a.) the system being measured violated Linearizability, or b.) the history-collecting system had, at some time, poor synchronization; or the VM paused; or the OS paused; or the application paused; and so on. It doesn’t allow you to make a provable claim of correctness or incorrectness.
If you’re talking about something like EC2 timesync, where clock errors are ~50us, and your storage system is remote and best-case a couple hundred microseconds, and you’re careful about which end of the clock bound to pick, you can still get good results.
NTP in general, probably not, unless you’re very careful about the way you set it up and have great time infrastructure.
would it work to use lamport clocks? or stamp messages with the ones they’ve seen and then topologically sort (functionally the same thing but different)?
Sort of. You can use messages between nodes and a strategy like Lamport clocks to establish a partial order which is consistent (because messages always flow forwards in time) with the real-time order. However, you’d fail to detect violations of Linearizability over timescales shorter than the message propagation interval.
wouldn’t any really linearisable logging scheme induce synchronisation between nodes, potentially also causing you to fail to detect some bugs? are there bugs that are still systematically more likely to be caught or can only be caught that way? (it seems not very nice because it sacrifices scalability)
IIUC, the difference between sequential consistency and linearisability is effectively that linearisability is consistent with communication ‘outside the system’ where seqcst doesn’t have to be. and lamport clocks or causal ordering should not miss sequential consistency violations. but if there are any classes of ‘external communication’ that we’re interested in in some context, couldn’t we try to bring those into the testing/logging system too? (i think this is similar to using a linearisable logging system, except that you don’t record all the ordering relationships—so you don’t get to full linearisability—only some of them, but what you do record you get in a distributed way)
wouldn’t any really linearisable logging scheme induce synchronisation between nodes, potentially also causing you to fail to detect some bugs?
Yep! It’s actually impossible to measure precisely. Jepsen does this by putting all clients on a single node, allowing fast connections between clients and servers, and (optionally) adding latency between servers; this lets it see errors “faster than light”, so to speak.
(it seems not very nice because it sacrifices scalability)
In general, I wouldn’t worry about scalability. Linearizability checking is NP-hard and relies on total histories–both of those are already bad for scalability. Even if that weren’t the case, you’ll get the most bang for your buck testing in a controlled environment where you can induce failures, rather than in production, where failures are (I hope!?) comparatively infrequent.
Lamport clocks or causal ordering should not miss sequential consistency violations.
I’m not sure what this means exactly, but I should note that Causal is, in general, weaker than Sequential.
But if there are any classes of ‘external communication’ that we’re interested in in some context, couldn’t we try to bring those into the testing/logging system too?
Sure, you could do that, but then the property you’re measuring is something like Causal or Sequential, rather than Linearizable.
Yes you’ve got it! If the project is in Go you could just write the fuzzer and test harness in Go. Otherwise yes you’d want to write some intermediate format from your system and then write a Go program to convert it for Porcupine.
People are asking what formal methods Amazon is using. My understanding (as a non-Amazon person) is that the answer is : “a bit of everything”. They have been hiring people with expertise is automated theorem provers (the main Z3 author moved to Amazon), but also verified-programming systems like Dafny (the main author moved to Amazon), proof assistants, several of them (Hol Light, Coq, Lean), modelling tools like TLA+, etc. A bit of everything. From the outside it feels like they have many different (sub)teams interested in formalization problems or many different sub-problems that justify them, and people have a lot of freedom to pick the approach they think is best as long as it delivers result. They are also funding collaboration with academia to try out other approaches, new tools, etc.
There are more details for example in this PDF document which comes from one of those team, and mentions again many different tools for formal verification.
Assuming you’re referring to Leonardo de Moura, he was involved in the early days from 2012-2014 (self-describes as the main architect) but since then he moved onto developing Lean and Nikolaj Bjørner has been the principal developer along with Lev Nachmanson and Christoph Wintersteiger.
A historical note: Z3 is older than that. The classic paper on it places its first release in 2007, and it’s been leading the SMT competition for a very long time. I expect that development must have started somewhere between 2003 and 2005, but don’t quote me on that. Work on Lean started in 2013 so that must be roughly when Leo de Moura started working on it.
Yes, that’s pretty accurate. Bit of everything, whatever we need to solve the business problem.
In addition, we’ve got some internal tools that aren’t available externally (yet), and an increasing investment in combining AR techniques with neural/AI/LLM/whatever you wanna call it techniques. Missing from your list on the modelling tools side is P (https://p-org.github.io/P/whatisP/), which is also primarily developed at Amazon right now, and is widely used internally for distributed systems stuff.
From the outside it feels like they have many different (sub)teams interested in formalization problems or many different sub-problems that justify them, and people have a lot of freedom to pick the approach they think is best as long as it delivers result.
This has, historically, been AWS’s approach to this kind of “researchy” work. We’ve really tried to optimize for technology transfer, making sure the benefits of the work end up in production and benefiting customers, over some other concerns. I think that’s been very successful on the whole, although isn’t without its trade-offs compared to other models.
Oh, absolutely. This is, in some way, a very natural human thing: I talk shit about them, and by doing that I show that I’m one of us. And the rest of us laugh and slap me on the back, and buy me a pint, and I’m happy. But social media amps this up. Suddenly us isn’t a group around a table - it’s tens of thousands of people, and they are also there, hearing the things we’re saying about them. Sometimes they deserve it. Sometimes they are wrong. But, way too often, what we’re saying in these public forums is painful and hurtful and unnecessary. And, to Steve’s larger point, missing context that maybe would suggest that they don’t have it so wrong after all.
Yeah, I actually think this does happen.
Partially it happens because, as a field, we’re still too concerned with aesthetics and opinions, and not yet really mature about thinking about software creation as a production process. And so we pick tools based on the wrong criteria more often than we would like to admit.
Part of this is healthy. It drives the creation of better tools. Like Rust. And Go. Part is very unhealthy, because it means that we’re making decisions with emotion that shouldn’t be made that way, at least not outside of our hobbies (where, as This Old Tony says, “There’s no why in the home shop”, just do whatever makes you happy).
You know, them. The Johnny-come-latelies that haven’t been on the internet since the Usenet days. The Eternal September people who are ruining it for us enlightened oldheads :) Despite how thoughtful and insightful this post is, it just shows that we’re all works in progress, still learning how to communicate in the best ways.
Haha, part of what took me so long to write this post was that I knew I’d struggle a bit with saying the right things. And yeah, we’re surely all in-progress.
I don’t view the Eternal September as a story about old vs young, I view it as a story of how a small community can enforce norms that are difficult or impossible once the community grows to a certain size. Not a moral judgement in any way. But, maybe others view it in a different light. :)
The Eternal September story is just a classic hubris-nemesis tragedy, and indeed the only lesson I’ve been able to take from it is “keep the community small and vigorously defend your mores and values”… which is why I’m here on Lobsters. I wish I could come up with a less pessimistic interpretation, though.
I’ve seen it play out as we built Aurora DSQL - we chose Rust for the new dataplane components, and started off developing other components with other tools. The control plane in Kotlin, operations tools in Typescript, etc. Standard “right tool for the job” stuff. But, as the team has become more and more familiar and comfortable with Rust, it’s become the way everything is built. A lot of this is because we’ve seen the benefits of Rust, but at least some is because the team just enjoys writing Rust.
Sort-of. DynamoDB is interesting here: it’s actually a strongly consistent database (in that it’s fundamentally “CP”) but will allow readers to opt in to eventually consistent reads as a latency optimization. Writes are always synchronously replicated to a quorum of replicas, and strong consistency is always available to readers. In much the same way, Aurora PostgreSQL allows read replicas which are eventually consistent, while still ensuring that all writes are strongly consistent and ordered and written to a quorum.
The cluster can always become unavailable in some conditions, no matter what your software does. The only difference, really, between “CP” and “AP” systems is whether it is available on the minority side for writes and strongly consistent reading during a network partition. Nothing is stopping a “CP” system from being available and consistent on the majority side during a partition, but it has to be unavailable on the minority side.
You don’t mention durability here, which is a key part of the picture if you’re going to decide whether or not to involve multiple servers in a write. In fact, if you weren’t worried about durability, you could invent a modified version of Raft with significantly better average-case latency properties, but with some probability of data loss on single-machine failure.
Indeed, I missed the part about “read consistency”.
That is what I actually wanted to say though my formulation was inaccurate. In the use case of FlowG, we write more often than we read, so “write availability” is more important than “read consistency”. We want to ingest logs as fast as possible, and we trust that the pipeline will store them in the right place and/or trigger the correct webhooks, there are better tools than FlowG to actually view the data (Kibana for instance), which is why I actually use FlowG to forward logs to other systems (Datadog, Splunk, ElasticSearch, Zapier, you name it), and only store in FlowG the minimal amount of data.
Indeed, that’s a miss on my part, thanks.
Thanks a lot for the corrections! I’ll add them as notes to the article tomorrow evening :)
Edits are done, thank you again :)
There are many ways to divide up the field of software. One way is between tasks where “specification first” speeds things up, and tasks where that slows things down. Many systems- and infrastructure-level tasks are in the first bucket, and most UI, frontend, and crud tasks are in the latter bucket. Many things are also a mix of the two, with some places where specifications are known up-front, and some where they are discovered during development.
Most of the heat in the world of TDD (which is one flavor of specification first) comes from folks working on these two classes of problems talking past each other. This article seems like another example of that.
This doesn’t match my experience working with junior developers. It just seems to be that new folks always have a lot to learn, and the amount of knowledge that’s needed to be productive is going up over time.
There are reasons to be concerned about the effect that tools have on the way folks learn, and how raising the level of abstraction reduces the need to learn lower-level details, but this post just seems to be overstating things significantly.
(Extended version of my comment on this on HN)
The line between (1) and (3) is kinda arbitrary, at least from an algorithmic perspective, in that there’s a significant overlap between the best approaches at the high end of (1) and at the tightly-coupled end of (3).
You should meet my dearest enemy, NUMA. Also all kinds of things with cache hierarchies in modern multi-core and multi-socket, and SMT (another enemy of mine), and cache coherence limits, and hidden limits like memory bandwidth, and so on and so on.
Sure they can. Shared disk is common (NFS, Lustre, etc). Shared RAM is fairly common, although less so, and not necessarily good for most workloads. Shared CPU caches have existed, but I don’t know of a common system with them. Shared page caches are a very common design pattern in databases (e.g. https://www.cidrdb.org/cidr2023/papers/p50-ziegler.pdf).
This is a problem, but mostly a silly problem because of the legacy cost of DB connections. There’s no fundamental reason connections should be expensive or limited at this level.
Sometimes! There’s a whole body of research about when distributed algorithms require coordination. One example is the CALM theorem (https://arxiv.org/abs/1901.01930), and another is the ways that scalable database systems avoid read coordination (https://brooker.co.za/blog/2025/02/04/versioning.html).
I don’t believe this is true.
Sure, but it’s not clear why this is possible in a local context and not a distributed one (and, in fact, in may be easier in the distributed context). One example of how it’s easier in the distributed context is snapshotting in MemoryDB (https://brooker.co.za/blog/2024/04/25/memorydb.html).
Or significant speed ups because you avoid false sharing!
This is, of course, a false dichotomy. Distributed systems don’t only (or even primarily, in most cases) exist for scaling, but availability, resilience, durability, business continuity, and other concerns.
To make this purely about scalability is naive.
Indeed. Coordination avoidance is the fundamental mechanism of scaling. This is visible in CALM, in Amdahl’s law, and in many of the other frameworks for thinking through this space.
False dichotomy again. Algorithms can be both efficient and scalable, and the true shape of the trade-off between them is both super interesting and an ongoing research area.
I normally enjoy Hillel’s writing and thoughtfulness, but this post seems like a big miss.
I think you’re making a mountain out of a molehill.
My sense of this is not very calibrated, so can someone weigh in on this? 0.8 was released more than 4 years ago, and 0.9 was released last week.
What exactly would be perfect? No dependencies at all? I mean, let’s look at what it’s actually depending on there.
rand_chachaandrand_coreare part of the same project, so they shouldn’t count as extra dependencies.libcis pretty foundational.getrandomis also part of rust-random and I think it’s nice that they have it separated so that people who just want random numbers from the system can do so without needingrand.cfg-if, I guess this should be part of the standard library or something?ppv-lite86is an implementation of a cryptographic primitive, which seems fair to outsource.None of those seem like things you’d want to remove.
Ok, I guess that confirms that, you think the perfect version of
randwould be one that has its own implementation of ChaCha20 and which isn’t split into 3 crates. I really don’t see what difference that would make, except make the raw count of crates lower.I didn’t bother to look at the other crates, but this one stuck out to me as seeming unlikely:
So I went and looked at the code.
lib.rsis 1587 lines of tests, 1439 lines of comments, and 644 lines of code.io.rsis 1149 lines of comments and 385 lines of code. So 1029 lines of actual code that gets compiled. It’s highly repetitive, implementing the big endian and little endian read and write logic for a bunch of different types. More macros could reduce the line count, but that wouldn’t reduce the amount of code needing to be compiled nor the size of the machine code.Why would you put terminal size detection in the standard library? What? Mmmmmaybe getrandom should be included. The criteria right now is that something should be in the standard library if it has to be in the standard library, or if pretty much every program needs it. But actually doing the work takes time. Safe transmute might render many uses of zerocopy obsolete, but safe transmute isn’t done yet!
Also why would you care what people on Twitter say? Most people I’ve seen who engage in dependency discourse over there seem to be brofluencers who spend more time posting hot takes to social media than they actually spend coding.
From a long-term perspective, having random number generation not in the standard library seems like the right bet.
rand(3)is broken, both from an implementation quality perspective (on many, but not all, platforms), from a performance perspective (unnecessary synchronization for most applications), and from an API perspective (at least for multi-threaded and forking programs), and surprisingly weak guarantees. Those things may not have been visible to the original creators ofrand, though, because they’re mostly due to newer concerns (multi-core, crypto-related stuff we’ve learned over the decades since, etc).RNG is hard because of crypto, and simulation and other numeric stuff, and all kinds of other reasons that make it very hard to just have one good RNG that solves all problems well. It’s a deep enough problem that there’s even a ton of academic beef about the right way to do it.
Honestly, I think Rust has gotten this right.
Learning (or starting to learn) lean and especially following the mathematics in lean book. I really like how Lean allows me to build from the very basics up to powerful formalizations. I’m still learning: one of my goals is to formalize the FLP result in Lean, and I haven’t quite completed that yet. The IDE experience of Lean is also really great.
Completing following Andrej Karpathy’s Zero to Hero series. I really liked getting hands-on implementing things at this low level, thinking through the trade-offs, etc. I get a lot more out of reading ML papers now. I also spent some time implementing some of the major vector search algorithms, another very worthwhile exercise.
I hadn’t read Jim Gray’s A Transaction Model until this year, despite reading a lot of things that reference it. Truly a great paper.
On the non-work side, I set out to read some books I’d been recommended but was avoiding because I felt they were cliches. Seven Years in Tibet totally wasn’t the book I was expecting, and was absolutely fascinating (is it true? no idea). I also enjoyed Moneyball and Liar’s Poker, but not The Big Short. My top non-fiction of the year was The Shining Mountain. Most disappointing was Herzog’s Annapurna. He comes across as insufferable.
If you’re reading climbing books you might already be aware of it, but I enjoyed the documentary Touching the Void. It’s a gut wrenching story that thankfully ends well (not a spoiler since both climbers are involved in the media).
The movie is based on a book with the same name if that’s more you’re fancy, though I haven’t read it myself but it got awards.
The book is great, and well worth a read just as an adventure story.
This article is missing something important about the nature of queues, and why they are so successfully deployed in so many different places.
In “open” systems[1], especially systems which see traffic that’s highly bursty, seasonal, queues allow systems to successfully serve customer traffic during short periods of excess load by adding “enqueue” to the existing options of “serve now” or “reject”. Highly bursty traffic is common (e.g. on a single server, or single network link), especially at smaller scales, and so this is extremely useful.
On the other hand, if the long-term utilization exceeds 1.0 (or even approaches 1.0 [2]), then they just add latency for no benefit. In other words, if the long-term request arrival rate exceeds the long-term service rate, you’re in trouble and queues are just going to bury you under more trouble.
In “closed” and “hybrid” systems, you can handle these overloads by asking clients to slow down their request rate (as TCP does). In “open” systems, this doesn’t really help: asking a client to come back later is just an implicit queue, because it doesn’t slow down the rate of new clients entering the system (unless some clients “balk”, or just leave once asked to slow down).
As the article says, these slow down signals work less well when queues are involved, because they delay the point in time when you tell the client to slow down to after the damage is done and you’ve built up a big queue you need to drain. Draining the queue takes time, during which (depending on the queue order), the system isn’t able to do new work for more recent requests, making the overload condition last longer (and, potentially, forever).
I think it’s important to think about “open” vs “closed” and “short-term” vs “long-term” when thinking about the role of queues in distributed systems. What’s good for short-term open systems may be bad for the long-term in closed systems. You also need to know whether you’re optimizing for latency (in which case you want to keep utilization <<1.0) or throughput.
[1] In the sense of https://www.usenix.org/conference/nsdi-06/open-versus-closed-cautionary-tale [2] See https://brooker.co.za/blog/2021/08/05/utilization.html
Temporal memory safety problems seem to be a subset of temporal safety violations more generally. You can think about “allocate” and “free” as events, and the correct protocol implementation needs to order these events appropriately to implement safety correctly. You can make the whole concurrent aspect of it go away by defining “free” as “replaces current values with random values” or even “replaces current values with adversarial chosen values”.
If you’re thinking of a TLA+-style model of this system, you could have an action that randomized the contents of any memory marked “free” (which is enabled as long as there is some memory marked “free”). Interestingly, this is much easier to model in TLA+‘s “shared memory” world than in, say, P’s “message passing” world.
But the overall point is right, terms here are important, and help folks understand.
The other claim I see in this space a lot is that temporal memory safety only matters because of type confusion (i.e. re-using the same memory across different types). That’s also not true.
Another great blog from Murat.
It’s worth mentioning that since this thesis was written, EC2 time sync has improved by multiple orders of magnitude: https://aws.amazon.com/about-aws/whats-new/2023/11/amazon-time-sync-service-microsecond-accurate-time/ Databases from Amazon (Aurora Limitless and Aurora DSQL) and others (e.g. Yugabyte) are using these clocks in production to improve read performance and avoid co-ordination between components.
In DSQL, high quality clocks allow us to avoid all cross-replica and cross-shard coordination during reads (while stile providing strong consistency and isolation). It’s worth noting that all of these systems use some form of hybrid clock - they depend on the physical clock ranges for some properties, and depend on a logical clock approach for other properties (in DSQL read consistency uses physical clock properties, while isolation, durability, and write ordering depends on logical clock properties).
For a classic (1991!) look at the topic, check out Barbara Liskov’s Practical Uses of Synchronized Clocks in Distributed Systems (https://dl.acm.org/doi/pdf/10.1145/112600.112601).
Maybe I’m old-fashioned but my teeny little local clock tends to average about 800 nanoseconds of drift. Even remotely. I’m below a millisecond. That is on the same network. Even on a distributed Network, if it’s within the same view and it’s not too far apart, I don’t see the reason for all this complexity unless they’re like 50 mi apart and then it gets more interesting. I should note I’m using a pulse per second raspberry pi with GPS.
Doing great clocks at the room/lab scale isn’t particularly difficult (although nanoseconds would be rare, even small things like people standing too near a cable can cause issues at those time scales). What’s difficult is doing it at datacenter, region, and global scale. There’s no common view, cable runs are a lot longer, failures more frequent, etc.
I spent a bunch of time at grad school distributing clocks between radar receivers at the “several steps apart” scale, and then at the “hundreds of km apart” scale, and ran into very similar issues (although there we were also using coherence with common view transmitters).
I obviously can’t speak for all the hyperscalers, but a lot of folks at AWS sure are using
async/await, increasingly alongsideio_uringto great effect. There’s always money in performance, of course, but one of the great things about the Rust ecosystem is how easy it makes it to get to “really very good” performance.As a concrete example, when we were building AWS Lambda’s container cache, we build a prototype with
hyperandreqwestandtokiotalking regular HTTP. We totally expected to need to replace it with something more custom, but found that even the prototype was saturating 50Gb/s NICs and hitting our latency and tail latency targets, so just left them as-is.I think the reality is that, for high-performance high-request-rate applications, a custom scheduler can do much better than Linux’s general purpose scheduler without a whole lot of effort (mostly because it just knows more about the workload and its goals).
This doesn’t seem right either.
Do you think you would have been able to saturate the NIC using the traditional OS threads model (no async/await)?
Yeah, for sure. But the
asyncmodel made the code clearer and simpler then the equivalent threaded code would have been, when taking everything into account (especially the need to avoid metastability problems under overload, which generally precludes a naive thread-per-request implementation).I do appreciate the tail latency angle. There are some (synthetic) benchmarks that show async/await being superior here. (Of course this depends on the async runtime too.) On the other hand, it seems to me a bit too niche requirement to justify async/await, assuming overload is not a very common situation. I am assuming async/await being a worse experience here. And reading from your comment you did not have that experience.
Services of all sizes need to protect against overload. My understanding is that async/await enables the server to handle as much load as the CPU(s) can handle without having to tune arbitrary numbers (e.g. thread pool size), and allowing the network stack to apply natural backpressure if the load increases beyond that point.
Edit to add: If it seems that designing a service to gracefully handle or prevent overload is a niche concern, perhaps that’s because we tend to throw more hardware at our services than they really ought to need. Or maybe you’ve been lucky enough that you haven’t yet made a mistake in your client software that caused an avoidable overload of your service. I’ve done that, working on a SaaS application for a tiny company.
When using tokio (and this goes for most async runtimes) it is actually not recommended to use async/await for CPU-bound workloads. The docs recommend using
spawn_blockingwhich ends up in a thread pool with a fixed size: https://docs.rs/tokio/latest/tokio/#cpu-bound-tasks-and-blocking-codeNo, that’s about hogging up a bunch of CPU time without yielding. It doesn’t apply if you have tasks that use a lot of CPU in total but yield often, or simply have so many tasks that you saturate your CPU. I’m pretty sure the latter is what @mwcampbell was referring to with “enables the server to handle as much load as the CPU(s) can handle”.
Tail latency is extremely important and undervalued. This is why GC languages are unpopular in the limit, for example — managing tail latencies under memory pressure is very difficult.
edit: of all groups of lay people, I think gamers have come to understand this the best. Gamers are quite rightly obsessed with what they call “1% lows” and “0.1% lows”.
As an AWS user, I can say you can saturate S3 get objects calls with async await pretty easily as well, to the point where there’s a few github issues about it. https://github.com/awslabs/aws-sdk-rust/issues/1136 <– essentially you have to hold your concurrency to between 50-200 depending on where you’re situated wrt s3.
What I find so frustrating about this paper is the fact it doesn’t engage with the long history of replacing TCP in the datacenter, the set of already-deployed solutions, and what we’ve learned from them. Part of that is that the solutions are mostly proprietary (but not secret, e.g. https://ieeexplore.ieee.org/document/9167399), but I can’t explain all of it.
Then it says things like:
Every aspect? Except the aspects of the design you’ve chosen to keep in your proposed solution?
I feel like if a student presented this at a conference it would get very poorly received, but this is getting a lot of airtime because a famous name is attached to it.
Yeah, given that the paper is really about how to do RPCs as efficiently as possible, maybe another framing would have been better.
I wrote up some thoughts on the PRFAQ too: https://brooker.co.za/blog/2024/11/14/lambda-ten-years.html
Writing docs like these PRFAQs is a really hard thing to do well, and I think this one is especially well written and clear.
The only thing wrong with this statement is the ‘this is a personal opinion’ part.
Well, that’s a big
if, isn’t it?Traditional relational databases are pretty great, but when your use-case is well-served by them depends a great deal not only on the use-case, but also on what you consider to be ‘well served’, and what you consider to be a traditional relational database.
Here are a couple of non-scalability reasons to think about why “traditional” might not be the right approach (or, at least, to question the definition of “traditional”):
Databases tend to be extremely sensitive to cache hit rate for performance. This is sometimes good: high locality workloads can lead to extremely good performance. It’s also a sharp edge. Failing over often means losing the cache, which can take minutes or even hours to rehydrate on a new machine, causing latency issues and possibly metastable failure issues upstream. You can mitigate this by keeping failover candidates caches warm (like Aurora), but is that still “traditional”? Changes in the size or composition of your working set can cause sudden changes in performance, perhaps over multiple orders of magnitude. The size and composition of your working set may not be under your control, and may even be under the control of attackers. You can mitigate this by being able to scale cache in-place, but is that still “traditional”? You can also mitigate it by buying a lot of memory, which is an OK solution but not always the most economical (you’ve just built an ad-hoc in-memory database).
Databases have historically been large and complex attack surfaces, and handle a whole lot of ad-hoc data. Because of this, it’s a good idea to keep them patched and up-to-date, and ideally keep relatively current with new major versions. But patching and upgrading relational databases without risk requires a lot of expertise if you’re doing it yourself. You can mitigate this by picking a database service that manages upgrades for you, but is that still “traditional”? How hard upgrading is tends to be proportional to how many database features you depend on, and proportional to how painful a long period of downtime can be.
Same story with durability. Can you lose data? Should you use a service that provides PITR and backup management and so on. What’s your ransomware strategy? What’s your logical corruption (“oops, I forgot the WHERE on a DELETE”) strategy? And so on.
Good system designers have a lot to consider in the data space, and would do well to ignore thought-terminating ideas like “traditional”. It may well be that the needs of your business are well met by a simple solution, or that you can accept some operational debt now to pay in future. Or, it may well be that something on the managed service spectrum offers you a ton of operational value. At very least consider availability, security, durability, predictability of performance, and (yes) scalability when you pick.
Honestly? No, not really. Everything you described is just part of having a databases, relational or not.
I’ve come to strongly dislike Java’s “silently allow badness” approach to thread safety, and appreciate Rust’s much more explicit approach (with things like
Send+Sync). I introduced a HashMap concurrency bug that souds a whole lot like the one the poster is talking about into a Java codebase in my first year at AWS (so 2009-ish). Code compiled, code ran fine in test, code ran fine on low-scale machines, code blew up in production (latently!). Ugh.There are some nice ways Java improved on the thread safety status quo over C and C++ (like a proper memory model and a rich standard library of good concurrent algorithms and primitives), some of which have made their way into those languages. But the fundamental approach is still way too easy to get wrong.
Go is an interesting case here because in one sense it doesn’t improve over Java. It’ll let you make all the mistakes. But, if you follow the conventions for structuring multi-threaded code you can very effectively avoid mistakes. So it’s a kind of half solution, but a very useful one.
It’s easy to sit back and say “oh, I simply wouldn’t have written those bugs”, but that’s a naive view. For example, it assumed that every piece of code you write is either fully thread safe or never ever will be refactored into a multithreaded context. Without the compiler helping you with that, I don’t know how that’s generally possible. It also assumes that you’re willing to spend a bunch of brainpower keeping track of things that a compiler can keep track of for you, and that seems inefficient at best. Computers are great at “apply these rules everywhere” in a way that humans just aren’t, and I’ve never understood the reluctance of programmers to take advantage of that.
Doesn’t sound like a terrible engineering practice to me. Sounds like a pragmatic solution making the most of the tools available in the environment, which is ultimately what good engineering looks like. You don’t want to gather too many work-around like this, because each adds complexity to a system and makes it harder to reason about, but as a short-term mitigation its nothing to be embarrassed about at all.
I’ve really come to appreciate the approach of cycling capacity and replacing it with freshly imaged new capacity continuously. There’s an efficiency trade-off on the frequency, but if you pick a good frequency it effectively eliminates whole classes of resource exhaustion problems in cloud systems. There’s a sense of craftsmanship in “our systems shouldn’t have those resource exhaustion bugs in the first place” which isn’t entirely wrong, but does distract from spending the same resources preventing more impactful bugs (like correctness bugs).
The interesting thing about the post is that this was also a correctness bug from what it sounds like, and the explanation for why it’s OK to have that out in production for so long isn’t clear from the post. But its not generally true that resource exhaustion bugs are correctness bugs, and hitting them with the heavy hammer of replacing machines (or VMs or MicroVMs) works very well.
Sorry for the random comment, but that brought back a moment quite a while ago (it could’ve been 2007 or 2008): I was called for a performance problem on an ATG e-commerce website and in a couple of hours on the first day I found that they had a droplet which was instantiated on multiple threads, and all of those threads were calling HashMap.containsKey() on a variable also modified by multiple threads… Nothing like that to achieve wizard-like status in the eyes of your client.
You’re far from the first. Thread safety is a problem older than most people realize (definitely older than Java). Actors are a very good tried-and-true method to sharing data without race conditions if you don’t have Rust’s ownership/trait system, and you also get the “just restart it” behavior out of the box. Only downside is, of course, everything is runtime defined and you don’t get any help from the compiler in regards to coordinating access like you can with Rust.
Java itself has the third-party Akka framework, but Kotlin and Clojure also have their own built-in systems for highly parallel share-nothing workloads (I know Kotlin has actors, Clojure doesn’t but it’s because it doesn’t need them, but I’m not sure what they use instead.) Of course, everyone should be using anything but Java if targeting the JVM, there’s not really a good excuse to use Java directly anymore (well, there never was, but Sun’s marketing department made sure we got stuck with Java anyway.) Kotlin’s basically done to Java what Rust has done to C++.
Shared-nothing patterns are nice when you can get them, but most real-world systems end up needing to share state (distributed and durable state like work queues and databases, local and ephemeral state like connections and caches, etc). The thread safety bugs come in with that shared state. If you’re in a domain where there’s no shared state, then great, take advantage of that. If you’re in a domain that can be modeled with actors or CSP or whatever, great, take advantage of that. If you’re in a domain where shared state can all be made somebody else’s problem through an interface (e.g. and RDBMS through an ACID transaction interface), then take advantage of that.
But that’s not all applications, and fairly few systems programs, and so lower-level primitives for state sharing are important and useful.
Java become popular at a time when most of the alternatives were significantly worse, lacking at least one of: memory safety, a real memory model, a rich standard library, a wide variety of open-source and proprietary libraries, built-in monitoring and observability, a sane cross-platform story, or decent performance. There are tons of better choices today (on and off the JVM), but the idea that Java’s only popular because Sun bamboozled us isn’t right.
Because Clojure data structures are immutable and persistent, concurrency is a lot less painful to deal with than it is in some other languages. I wouldn’t say it is painless but it is pretty good.
If you want something more CSP-like, Clojure also has core.async, which provides lightweight threads that communicate over channels. It looks quite a lot like Go: the lightweight threads are even declared using “go blocks”.
(Edit: removed conflation of actors and channels)
Nit: channels and actors are notably different.
Mixing Metaphors: Actors as Channels and Channels as Actors, Simon Fowler, Sam Lindley, and Philip Wadler, 2017: https://simonjf.com/writing/acca.pdf
“Channel- and actor-based programming languages are both used in practice, but the two are often confused”
Good point! I think I read that paper when it first came out and then gradually slid back into conflating the two. I’ll edit my comment.
Akka is one thing (it is btw written in Scala and not Java, but you can use it from Java).
The other solution is immutability / functional programming, which comes at a performance cost though.
So if I understand this right, the intended usage is that you model the expected behavior, you instrument your real code to produce the traces and then you deploy it or fuzz it to get a lot of traces you can check for linearizability violations?
In practice what’s the easiest way to capture the traces? Just log some JSON? Derive it from otel spans?
You do need to ensure that whatever system records the history of events is itself Linearizable (hand-waving a bit here; you can actually get away with slightly weaker properties). That could be a single-node program which records invocations and completions in a Linearizable in-memory structure. Or you could journal them (synchronously!) to a Linearizable database before invocation and after completion, in production.
What you can’t do is, say, stamp log lines or OTEL events with local node clocks–those aren’t necessarily synchronized, and the checker might tell you your program violated Linearizability when it didn’t. You also can’t send them to a database asynchronously, because the message recording the invocation might be delayed and arrive at the database after the operation was actually performed. You also can’t use a Serializable or Causally consistent database, because it might fail to record your events in real-time order. A Sequential log is OK, but the scope of the Sequential object has to be be the entire log; otherwise you could (you guessed it!) get reorderings.
Another thing to be aware of is that if you log to the same datastore that’s under test, you could influence the results. For example, if your data store has an issue with stale reads in read-only transactions, and you add an “insert log line” between every read transaction down the same connection, you could see the bug become invisible.
In practice is it enough to synchronize the clocks using something like ntp?
I mean… no? Imagine you detected a Linearizability violation in a history recorded using timestamps on, say, Linux machines. From this you could conclude that either a.) the system being measured violated Linearizability, or b.) the history-collecting system had, at some time, poor synchronization; or the VM paused; or the OS paused; or the application paused; and so on. It doesn’t allow you to make a provable claim of correctness or incorrectness.
Maybe?
If you’re talking about something like EC2 timesync, where clock errors are ~50us, and your storage system is remote and best-case a couple hundred microseconds, and you’re careful about which end of the clock bound to pick, you can still get good results.
NTP in general, probably not, unless you’re very careful about the way you set it up and have great time infrastructure.
would it work to use lamport clocks? or stamp messages with the ones they’ve seen and then topologically sort (functionally the same thing but different)?
Sort of. You can use messages between nodes and a strategy like Lamport clocks to establish a partial order which is consistent (because messages always flow forwards in time) with the real-time order. However, you’d fail to detect violations of Linearizability over timescales shorter than the message propagation interval.
hrm—forgive me if these are stupid questions, but
wouldn’t any really linearisable logging scheme induce synchronisation between nodes, potentially also causing you to fail to detect some bugs? are there bugs that are still systematically more likely to be caught or can only be caught that way? (it seems not very nice because it sacrifices scalability)
IIUC, the difference between sequential consistency and linearisability is effectively that linearisability is consistent with communication ‘outside the system’ where seqcst doesn’t have to be. and lamport clocks or causal ordering should not miss sequential consistency violations. but if there are any classes of ‘external communication’ that we’re interested in in some context, couldn’t we try to bring those into the testing/logging system too? (i think this is similar to using a linearisable logging system, except that you don’t record all the ordering relationships—so you don’t get to full linearisability—only some of them, but what you do record you get in a distributed way)
Yep! It’s actually impossible to measure precisely. Jepsen does this by putting all clients on a single node, allowing fast connections between clients and servers, and (optionally) adding latency between servers; this lets it see errors “faster than light”, so to speak.
In general, I wouldn’t worry about scalability. Linearizability checking is NP-hard and relies on total histories–both of those are already bad for scalability. Even if that weren’t the case, you’ll get the most bang for your buck testing in a controlled environment where you can induce failures, rather than in production, where failures are (I hope!?) comparatively infrequent.
I’m not sure what this means exactly, but I should note that Causal is, in general, weaker than Sequential.
Sure, you could do that, but then the property you’re measuring is something like Causal or Sequential, rather than Linearizable.
Yes you’ve got it! If the project is in Go you could just write the fuzzer and test harness in Go. Otherwise yes you’d want to write some intermediate format from your system and then write a Go program to convert it for Porcupine.
People are asking what formal methods Amazon is using. My understanding (as a non-Amazon person) is that the answer is : “a bit of everything”. They have been hiring people with expertise is automated theorem provers (the main Z3 author moved to Amazon), but also verified-programming systems like Dafny (the main author moved to Amazon), proof assistants, several of them (Hol Light, Coq, Lean), modelling tools like TLA+, etc. A bit of everything. From the outside it feels like they have many different (sub)teams interested in formalization problems or many different sub-problems that justify them, and people have a lot of freedom to pick the approach they think is best as long as it delivers result. They are also funding collaboration with academia to try out other approaches, new tools, etc.
There are more details for example in this PDF document which comes from one of those team, and mentions again many different tools for formal verification.
By the way, the main HOL Light author (John Harrison) also moved to Amazon.
Assuming you’re referring to Leonardo de Moura, he was involved in the early days from 2012-2014 (self-describes as the main architect) but since then he moved onto developing Lean and Nikolaj Bjørner has been the principal developer along with Lev Nachmanson and Christoph Wintersteiger.
A historical note: Z3 is older than that. The classic paper on it places its first release in 2007, and it’s been leading the SMT competition for a very long time. I expect that development must have started somewhere between 2003 and 2005, but don’t quote me on that. Work on Lean started in 2013 so that must be roughly when Leo de Moura started working on it.
Yes, that’s pretty accurate. Bit of everything, whatever we need to solve the business problem.
In addition, we’ve got some internal tools that aren’t available externally (yet), and an increasing investment in combining AR techniques with neural/AI/LLM/whatever you wanna call it techniques. Missing from your list on the modelling tools side is P (https://p-org.github.io/P/whatisP/), which is also primarily developed at Amazon right now, and is widely used internally for distributed systems stuff.
This has, historically, been AWS’s approach to this kind of “researchy” work. We’ve really tried to optimize for technology transfer, making sure the benefits of the work end up in production and benefiting customers, over some other concerns. I think that’s been very successful on the whole, although isn’t without its trade-offs compared to other models.