It’s nice that more people are leaning into deterministic simulation while building correctness-critical distributed systems.
It’s not clear to me what they mean by strong consistency for a system that replicates multiple concurrent uncoordinated modifications on different devices, it would be nice if they went into that claim a bit more.
yeah, the deterministic simulation is my favorite tech in the whole project. it’s caught all types of bugs, from simple logic errors to complicated race conditions that we would have never thought to test. I think there’s some interesting work out there to bring more of this “test case generation” style of testing to a larger audience…
It’s not clear to me what they mean by strong consistency for a system that replicates multiple concurrent uncoordinated modifications on different devices, it would be nice if they went into that claim a bit more.
ah, sorry this wasn’t worded too clearly. we broke the sync protocol down into two subproblems: 1) syncing a view of the remote filesystem to the clients and 2) allowing clients to propose new changes to the remote filesystem. then, the idea is that we’d solve these two problems with strong consistency guarantees, and then we’d use these protocols for building a more operational transform flavored protocol on top.
we took this approach since protocol-level inconsistencies were very common with sync engine classic’s protocol. we spent a ton of time debugging how a client’s view of the remote filesystem got into a bizarre state or why they sent up a nonsensical filesystem modification. so, it’d be possible to build a serializable system on our core protocol, even though we don’t, and that strength at the lowest layer is still really useful.
yeah, the deterministic simulation is my favorite tech in the whole project. it’s caught all types of bugs, from simple logic errors to complicated race conditions that we would have never thought to test. I think there’s some interesting work out there to bring more of this “test case generation” style of testing to a larger audience…
I’ve been digging into a whole bunch of approaches as to how people do deterministic simulation. I’m really curious—how does your approach work? Can you provide some sort of gist/code example as to how those components are structured?
ah, I don’t have a good code sample handy (but we’ll prepare one for our testing blog post). but here’s the main idea –
we write almost all of our logic on a single thread, using futures to multiplex concurrent operations on a single thread. then, we make sure all of the code on that thread is deterministic with fixed inputs. there’s lots of ways code can sneak in a dependency on a global random number generator or time.
have traits for the interfaces between the control thread and other threads. we also mock out external time behind a trait too.
then, wrap each real component in a mock component that pauses all requests and puts them into a wait queue.
now, instead of just calling .wait on the control thread future, poll it until it blocks (i.e. returns Async::NotReady). this means that the control thread can’t make any progress until some future it’s depending on completes. then, we can look at the wait queues and psuedorandomly unpause some subset of them and then poll the control thread again. we repeat this process until the test completes.
all of these scheduling decisions are made psuedorandomly from a fixed RNG seed that’s determined at the beginning of the test run. we can also use this seed for injecting errors, generating initial conditions, and “agitating” the system by simulating other concurrent events. the best part is that once we find a failure, we’re guaranteed that we can reproduce it given its original seed.
in fact, we actually don’t even log in CI at all. we run millions of seeds every day and then if CI finds a failure, it just prints the seed and we then run it locally to debug.
There are so many beautiful properties of a system that is amenable to discrete event simulation.
You can use the seed to generate a schedule of events that happen in the system. When invariants are violated, you can shrink this generated history to the minimal set that reproduces the violation, like how quickcheck does its shrinking (I usually just use quickcheck for generating histories though). This produces minimized histories that are usually a lot simpler to debug, as causality is less blurred by having hundreds of irrelevant concurrent events in-flight. Importantly, separating the RNG from the generated schedule allows you to improve your schedule generators while keeping the actual histories around that previously found bugs and reusing them for regression tests. Otherwise every time you improve your generator you destroy all of your regression tests because the seeds no longer generate the same things.
Instead of approaching it from a brute force exploration of the interleaving+fault space, it’s often much more bug:instruction efficient to start with what has to go right for a desired invariant-respecting workload, and then perturbing this history to a desired tree depth (fault tolerance degree). Lineage Driven Fault Injection can be trivially applied to systems that are simulator friendly, allowing bugs to be sussed out several orders of magnitude more cheaply than via brute force exploration.
This approach can be millions of times faster than black-box approaches like jepsen, allowing engineers to run the tests on their laptops in a second or two that would have taken jepsen days or weeks, usually with far less thorough coverage.
Simulation is the only way to build distributed systems that work. I wrote another possible simulation recipe here but there are many possible ways of doing it, and different systems will benefit from more or less complexity in this layer.
thanks for the pointer for the molly paper, looks really interesting.
here’s another idea I was playing with a few months ago: instead of viewing the input to the test as random seed, think of it as an infinite tape of random bits. then, the path taken through the program is pretty closely related to different bits in the input. for example, sampling whether to inject an error for a request is directly controlled by a bit somewhere in the input tape.
this is then amenable to symbolic execution based fuzzing, where the fuzzer watches the program execution and tries to synthesize random tapes that lead to interesting program states. we actually got this up and working, and it found some interesting bugs really quickly. for example, when populating our initial condition, we’d sample two random u64s and insert them into a hashmap, asserting that there wasn’t a collision. the symbolic executor was able to reverse the hash function and generate a tape with two duplicate integers in the right places within minutes!
but, we didn’t actually find any real bugs with that approach during my limited experiments. I think the heuristics involved are just too expensive, and running more black box random search in the same time is just as effective. however, we do spend time tuning our distributions to get good program coverage, and perhaps with a more white box approach that wouldn’t be necessary.
I’ve also had disappointing results when combining fuzzing techniques with discrete event simulation of complex systems. My approach has been to have libfuzzer (via cargo fuzz) generate a byte sequence, and have every 2-4 bytes serve as a seed for a RNG that generates a scheduled event in a history of external inputs. This approach actually works extremely well for relatively small codebases, as a burst of Rust projects experienced a lot of nice bug discovery when this crate was later released, but the approach really never took off for my use in sled where it dropped the throughput so much that the coverage wasn’t able to stumble on introduced bugs anywhere close to as fast as just running quickcheck uninstrumented.
I’ve been meaning to dive into Andreas Zeller’s Fuzzing Book to gain some insights into how I might be able to better apply this technique, because I share your belief that it feels like it SHOULD be an amazing approach.
I don’t know anything about this project, but I do work on a system that has this property, I guess. There are lots of approaches but for me it’s just an exercise in designing the components carefully.
First, you want to draw strict borders between the core protocol or domain or business logic, and the way the world interacts with that core. This is a tenet of the Clean Architecture or the Hexagonal Architecture or whatever, the core stuff should be pure and only know about its own domain objects, it shouldn’t know anything about HTTP or databases or even the concept of physical time. Modeling time as a dependency in this way takes practice, it’s as much art as it is science, and it depends a lot on your language.
Second, you want to make it so that if the core is just sitting there with no input, it doesn’t change state. That means no timers or autonomous action. Everything should be driven by external input. This can be synchronous function calls — IMO this is the best model — but it can also work with an actor-style message passing paradigm. There are tricks to this. For example, if your protocol needs X to happen every Y, then you can model that as a function X that you require your callers to call every Y.
Once you have the first step, you can implement in-memory versions of all of your dependencies, and therefore build totally in-memory topologies however you like. Once you have the second step, you have determinism that you can verify, and, if you’re able to model time abstractly rather than concretely, you can run your domain components as fast as your computer can go. With the two combined you can simulate whatever condition you want.
I hope this makes sense. I’m sure /u/spacejam has a slightly? majorly? different approach to the challenge.
this is spot on! we still have periodic timers in our system, but we hook them into our simulation of external time. there’s some special casing to avoiding scheduling a timer wakeup when there’s other queued futures, but it more-or-less just works.
I totally agree that if you can make your core business logic a pure function, it dramatically improves your ability to understand your system. What you said about time is also possible for most things that flip into nondeterminism in production:
random number generators can be deterministic when run in testing
threads/goroutines can be executed in deterministic interleavings under test with the use of linux realtime priorities / instrumented scheduling / ptrace / gdb scripts etc…
Determinism can be imposed on things in test that need a bit of nondeterminism in production to better take advantage of hardware parallelism. You lose some generality - code that you compile with instrumented scheduler yields that runs in a deterministic schedule for finding bugs will trigger different cache coherency traffic and possibly mask some bugs if you’re relying on interesting atomic barriers for correctness, as the scheduler yield will basically shove sequentially consistent barriers at every yield and cause some reordering bugs to disappear, but that’s sort of out-of-scope.
There are some fun techniques that allow you to test things with single-threaded discrete event simulation but get performance via parallelism without introducing behavior that differs from the single threaded version:
software transactional memory
embedded databases that support serializable transactions
It seems trivial now but I remember trying dropbox just after launch and being awestruck at the simplicity of it. I loved the way they used the humble computer directory as the central part of the system. Made it so that people instantly knew how to use it.
hey, author here! happy to answer any questions here about the post, project, or rust :)
Did you guys look into TLA+ or similar systems to aid design shakedown? Seems like the perfect fit to complement your fuzz-style tests.
hey yeah we did. here’s a thread on HN with more details: https://news.ycombinator.com/item?id=22607270
It’s nice that more people are leaning into deterministic simulation while building correctness-critical distributed systems.
It’s not clear to me what they mean by strong consistency for a system that replicates multiple concurrent uncoordinated modifications on different devices, it would be nice if they went into that claim a bit more.
yeah, the deterministic simulation is my favorite tech in the whole project. it’s caught all types of bugs, from simple logic errors to complicated race conditions that we would have never thought to test. I think there’s some interesting work out there to bring more of this “test case generation” style of testing to a larger audience…
ah, sorry this wasn’t worded too clearly. we broke the sync protocol down into two subproblems: 1) syncing a view of the remote filesystem to the clients and 2) allowing clients to propose new changes to the remote filesystem. then, the idea is that we’d solve these two problems with strong consistency guarantees, and then we’d use these protocols for building a more operational transform flavored protocol on top.
we took this approach since protocol-level inconsistencies were very common with sync engine classic’s protocol. we spent a ton of time debugging how a client’s view of the remote filesystem got into a bizarre state or why they sent up a nonsensical filesystem modification. so, it’d be possible to build a serializable system on our core protocol, even though we don’t, and that strength at the lowest layer is still really useful.
Any tips on where to get started on this?
the threads on this post are a good place to start: https://lobste.rs/s/ob6a8z/rewriting_heart_our_sync_engine#c_8zixa2 and https://lobste.rs/s/ob6a8z/rewriting_heart_our_sync_engine#c_ab2ysi. we also have a next blog post on testing currently in review :)
Thank you! I’m looking forward to the next blog post too :)
I’ve been digging into a whole bunch of approaches as to how people do deterministic simulation. I’m really curious—how does your approach work? Can you provide some sort of gist/code example as to how those components are structured?
ah, I don’t have a good code sample handy (but we’ll prepare one for our testing blog post). but here’s the main idea –
now, instead of just calling
.wait
on the control thread future,poll
it until it blocks (i.e. returnsAsync::NotReady
). this means that the control thread can’t make any progress until some future it’s depending on completes. then, we can look at the wait queues and psuedorandomly unpause some subset of them and thenpoll
the control thread again. we repeat this process until the test completes.all of these scheduling decisions are made psuedorandomly from a fixed RNG seed that’s determined at the beginning of the test run. we can also use this seed for injecting errors, generating initial conditions, and “agitating” the system by simulating other concurrent events. the best part is that once we find a failure, we’re guaranteed that we can reproduce it given its original seed.
in fact, we actually don’t even log in CI at all. we run millions of seeds every day and then if CI finds a failure, it just prints the seed and we then run it locally to debug.
There are so many beautiful properties of a system that is amenable to discrete event simulation.
Simulation is the only way to build distributed systems that work. I wrote another possible simulation recipe here but there are many possible ways of doing it, and different systems will benefit from more or less complexity in this layer.
thanks for the pointer for the molly paper, looks really interesting.
here’s another idea I was playing with a few months ago: instead of viewing the input to the test as random seed, think of it as an infinite tape of random bits. then, the path taken through the program is pretty closely related to different bits in the input. for example, sampling whether to inject an error for a request is directly controlled by a bit somewhere in the input tape.
this is then amenable to symbolic execution based fuzzing, where the fuzzer watches the program execution and tries to synthesize random tapes that lead to interesting program states. we actually got this up and working, and it found some interesting bugs really quickly. for example, when populating our initial condition, we’d sample two random
u64
s and insert them into a hashmap, asserting that there wasn’t a collision. the symbolic executor was able to reverse the hash function and generate a tape with two duplicate integers in the right places within minutes!but, we didn’t actually find any real bugs with that approach during my limited experiments. I think the heuristics involved are just too expensive, and running more black box random search in the same time is just as effective. however, we do spend time tuning our distributions to get good program coverage, and perhaps with a more white box approach that wouldn’t be necessary.
I’ve also had disappointing results when combining fuzzing techniques with discrete event simulation of complex systems. My approach has been to have libfuzzer (via cargo fuzz) generate a byte sequence, and have every 2-4 bytes serve as a seed for a RNG that generates a scheduled event in a history of external inputs. This approach actually works extremely well for relatively small codebases, as a burst of Rust projects experienced a lot of nice bug discovery when this crate was later released, but the approach really never took off for my use in sled where it dropped the throughput so much that the coverage wasn’t able to stumble on introduced bugs anywhere close to as fast as just running quickcheck uninstrumented.
I’ve been meaning to dive into Andreas Zeller’s Fuzzing Book to gain some insights into how I might be able to better apply this technique, because I share your belief that it feels like it SHOULD be an amazing approach.
here’s another pointer for interesting papers if you’re continuing down this path: https://people.eecs.berkeley.edu/~ksen/cs29415fall16.html?rnd=1584484104661#!ks-body=home.md
I’ve kind of put it on the backburner for now, but it’s good to hear that you’ve reached similar conclusions :)
I don’t know anything about this project, but I do work on a system that has this property, I guess. There are lots of approaches but for me it’s just an exercise in designing the components carefully.
First, you want to draw strict borders between the core protocol or domain or business logic, and the way the world interacts with that core. This is a tenet of the Clean Architecture or the Hexagonal Architecture or whatever, the core stuff should be pure and only know about its own domain objects, it shouldn’t know anything about HTTP or databases or even the concept of physical time. Modeling time as a dependency in this way takes practice, it’s as much art as it is science, and it depends a lot on your language.
Second, you want to make it so that if the core is just sitting there with no input, it doesn’t change state. That means no timers or autonomous action. Everything should be driven by external input. This can be synchronous function calls — IMO this is the best model — but it can also work with an actor-style message passing paradigm. There are tricks to this. For example, if your protocol needs X to happen every Y, then you can model that as a function X that you require your callers to call every Y.
Once you have the first step, you can implement in-memory versions of all of your dependencies, and therefore build totally in-memory topologies however you like. Once you have the second step, you have determinism that you can verify, and, if you’re able to model time abstractly rather than concretely, you can run your domain components as fast as your computer can go. With the two combined you can simulate whatever condition you want.
I hope this makes sense. I’m sure /u/spacejam has a slightly? majorly? different approach to the challenge.
this is spot on! we still have periodic timers in our system, but we hook them into our simulation of external time. there’s some special casing to avoiding scheduling a timer wakeup when there’s other queued futures, but it more-or-less just works.
Nice to hear I’m not totally off base :)
I totally agree that if you can make your core business logic a pure function, it dramatically improves your ability to understand your system. What you said about time is also possible for most things that flip into nondeterminism in production:
Determinism can be imposed on things in test that need a bit of nondeterminism in production to better take advantage of hardware parallelism. You lose some generality - code that you compile with instrumented scheduler yields that runs in a deterministic schedule for finding bugs will trigger different cache coherency traffic and possibly mask some bugs if you’re relying on interesting atomic barriers for correctness, as the scheduler yield will basically shove sequentially consistent barriers at every yield and cause some reordering bugs to disappear, but that’s sort of out-of-scope.
There are some fun techniques that allow you to test things with single-threaded discrete event simulation but get performance via parallelism without introducing behavior that differs from the single threaded version:
Using Rust’s async/await to deterministically drive asynchronous operations is pretty clever.
It seems trivial now but I remember trying dropbox just after launch and being awestruck at the simplicity of it. I loved the way they used the humble computer directory as the central part of the system. Made it so that people instantly knew how to use it.