It strikes me that writing these distributed storage doohickeys without too many interesting bugs is incredibly difficult.
One thing I wonder about. If you ran the same kind of fault injection tests with ASAN or UBSAN compiled Scylla binaries, would that unearth potential crash/corruption bugs in rarely exercised code paths?
Simulation makes many of these distributed race conditions pop out in just a few milliseconds after hitting save. If you begin your distributed system codebase with the knowledge that race conditions need to be avoided, it’s actually pretty straightforward to tease them out before you ever open a pull request that introduces them. This technique, also known as discrete event simulation, is a more common technique in places where safety is prioritized, like automotive software. FoundationDB is a prominent user of this technique, and their track record speaks to the results.
But most distributed systems creators build their systems in a fundamentally bug-prone way that obscures their race conditions from them, and they tend to put-off race condition testing, leaving them with no alternative other than these black-box techniques that are far, far less efficient both in CPU and human terms, yet nevertheless almost always yield paydirt because of the poor engineering practices that the codebase started with. Kyle will continue to have success for a long time :)
I am not an expert in this space. It reads to me like you are suggesting that people establish a light interface that abstracts over the networking details. This interface (for example: receive and tick functions) allow people to write unit, property, fuzz, etc tests that leverage this interface to exercise the system. This sounds very reasonable.
On a somewhat related note, it frustrates me to no end that very few web frameworks allow me to inject a request and get back out a response without going over the network.
It reads to me like you are suggesting that people establish a light interface that abstracts over the networking details.
Kind of — though I’m not sure “abstracts over” is the best way to describe it, or even that “simulation” or “simulator” is the best way to describe what’s actually being suggested. I think the lede is buried here:
write your code in a way that can be deterministically tested on top of a simulator
The important thing is to implement each component of your system as an isolated and fully deterministic state machine. Nothing should be autonomous, everything should be driven by external input. This requires treating the network as a totally separate layer — something that provides input to, and receives output from, the system. It also requires treating time itself as a dependency, which can have significant impact on design choices. For example, you can’t have a component that does something every N seconds, you have to model that as e.g. a function that should be called every N seconds. Or, you can’t use an HTTP client that has an N second request timeout, you have to e.g. use requests that are explicitly canceled by the caller after they determine a timeout period has elapsed.
This doesn’t require your system to follow the actor model, where components interact by sending and receive messages. I think the specific framing of the article kind of obscures that.
If every component in the system is a side-effect-free state machine, then everything is also perfectly isolated. That means you can construct arbitrarily complex topologies, drive them with whatever sequence of operations you can construct, and verify system behavior, all in-memory and for cheap. This is I guess what’s meant by simulation. But, at least to me, that’s the obvious consequence of the design, it’s the design that’s worth focusing on.
I need to rewrite my simulation post to make a few aspects more clear, I think. Determinism is orthogonal to simulability. You get the race condition detection with some function that “scrambles” events that are concurrent. This can be accomplished with a simpler network abstraction and on the whole is far cheaper to implement than determinism, and it’s far far cheaper to retrofit into a legacy system than full determinism, yet it’s what yields bug discovery paydirt.
Determinism is what lets you get a non-flaky regression test in some cases “for free” which counts as a mechanically generated test. For this property to hold, your deterministic business logic must interact with your deterministic testing in such a way that when you modify your business logic or tests, it does not change the “history” that the test induces. A common example of a type of deterministic test that nevertheless fails to hold this nice property is to use a simple random number seed as the deterministic replay key. When business logic or test logic changes, the seed causes different behavior. The seed will still be nice for deterministic debugging before code changes, but it’s not useful for verifying that the bug has possibly been addressed if the seed causes no failure after a code change.
For something that preserves history a bit better, I tend to opt for tests that generate a sequence of actions, rather than a simple replay seed. This is still not enough to guarantee that your deterministic test is actually a regression test, but it’s a step in the right direction.
So, simulation shuffles concurrent events during test, causing many interesting bus to jump out very quickly. Determinism gives you replay for helping with understanding the issue. Sequence-of-events tests is nice for preserving history and improving the regression test nature of a mechanically generated test. Each of these combined has been known in engineering for decades as “discrete event simulation”.
True. But reading your post makes me think neither determinism or simulability (woof, what a word) is actually the thing to focus on. They both feel like obvious emergent properties of a system that models all of its dependencies explicitly. That feels like the important thing — externalizing and injecting dependencies.
It strikes me that writing these distributed storage doohickeys without too many interesting bugs is incredibly difficult.
One thing I wonder about. If you ran the same kind of fault injection tests with ASAN or UBSAN compiled Scylla binaries, would that unearth potential crash/corruption bugs in rarely exercised code paths?
Great writeup as always. Thank you @aphyr. ♥️
Simulation makes many of these distributed race conditions pop out in just a few milliseconds after hitting save. If you begin your distributed system codebase with the knowledge that race conditions need to be avoided, it’s actually pretty straightforward to tease them out before you ever open a pull request that introduces them. This technique, also known as discrete event simulation, is a more common technique in places where safety is prioritized, like automotive software. FoundationDB is a prominent user of this technique, and their track record speaks to the results.
But most distributed systems creators build their systems in a fundamentally bug-prone way that obscures their race conditions from them, and they tend to put-off race condition testing, leaving them with no alternative other than these black-box techniques that are far, far less efficient both in CPU and human terms, yet nevertheless almost always yield paydirt because of the poor engineering practices that the codebase started with. Kyle will continue to have success for a long time :)
I am not an expert in this space. It reads to me like you are suggesting that people establish a light interface that abstracts over the networking details. This interface (for example: receive and tick functions) allow people to write unit, property, fuzz, etc tests that leverage this interface to exercise the system. This sounds very reasonable.
On a somewhat related note, it frustrates me to no end that very few web frameworks allow me to inject a request and get back out a response without going over the network.
Kind of — though I’m not sure “abstracts over” is the best way to describe it, or even that “simulation” or “simulator” is the best way to describe what’s actually being suggested. I think the lede is buried here:
The important thing is to implement each component of your system as an isolated and fully deterministic state machine. Nothing should be autonomous, everything should be driven by external input. This requires treating the network as a totally separate layer — something that provides input to, and receives output from, the system. It also requires treating time itself as a dependency, which can have significant impact on design choices. For example, you can’t have a component that does something every N seconds, you have to model that as e.g. a function that should be called every N seconds. Or, you can’t use an HTTP client that has an N second request timeout, you have to e.g. use requests that are explicitly canceled by the caller after they determine a timeout period has elapsed.
This doesn’t require your system to follow the actor model, where components interact by sending and receive messages. I think the specific framing of the article kind of obscures that.
If every component in the system is a side-effect-free state machine, then everything is also perfectly isolated. That means you can construct arbitrarily complex topologies, drive them with whatever sequence of operations you can construct, and verify system behavior, all in-memory and for cheap. This is I guess what’s meant by simulation. But, at least to me, that’s the obvious consequence of the design, it’s the design that’s worth focusing on.
I need to rewrite my simulation post to make a few aspects more clear, I think. Determinism is orthogonal to simulability. You get the race condition detection with some function that “scrambles” events that are concurrent. This can be accomplished with a simpler network abstraction and on the whole is far cheaper to implement than determinism, and it’s far far cheaper to retrofit into a legacy system than full determinism, yet it’s what yields bug discovery paydirt.
Determinism is what lets you get a non-flaky regression test in some cases “for free” which counts as a mechanically generated test. For this property to hold, your deterministic business logic must interact with your deterministic testing in such a way that when you modify your business logic or tests, it does not change the “history” that the test induces. A common example of a type of deterministic test that nevertheless fails to hold this nice property is to use a simple random number seed as the deterministic replay key. When business logic or test logic changes, the seed causes different behavior. The seed will still be nice for deterministic debugging before code changes, but it’s not useful for verifying that the bug has possibly been addressed if the seed causes no failure after a code change.
For something that preserves history a bit better, I tend to opt for tests that generate a sequence of actions, rather than a simple replay seed. This is still not enough to guarantee that your deterministic test is actually a regression test, but it’s a step in the right direction.
So, simulation shuffles concurrent events during test, causing many interesting bus to jump out very quickly. Determinism gives you replay for helping with understanding the issue. Sequence-of-events tests is nice for preserving history and improving the regression test nature of a mechanically generated test. Each of these combined has been known in engineering for decades as “discrete event simulation”.
True. But reading your post makes me think neither determinism or simulability (woof, what a word) is actually the thing to focus on. They both feel like obvious emergent properties of a system that models all of its dependencies explicitly. That feels like the important thing — externalizing and injecting dependencies.