I’d also like to emphasize the call for ideas about deterministic testing of concurrent systems, where one component is the Unix kernel.
This seems like it should have been done before but I don’t know of any work on it. I’m tired of flaky tests.
I know there is a POSIX file system model, although it’s not clear to me if it’s actually practical for testing – https://www.cl.cam.ac.uk/~pes20/rems/
But what about process scheduling and signals? The shell gets nondeterministic waitpid(-1) and signal input from the kernel. Both the shell and the kernel maintain process state.
It seems to me that the Unix kernel is the OG concurrent system and there should be a solution for this :) I suggested maybe a User Mode Linux or NetBSD rump kernel – maybe they can run with a deterministic seed or something like that? And inject faults?
AFAIK things like the scheduler or the signal delivery infrastructure are inherently non-deterministic, in the sense that their canonical APIs simply don’t offer enough control to consumers to provide deterministic behavior.
If you want determinism against these kinds of things, my experience is that you generally have to orchestrate it yourself, in your application. Typically that’s done with an intermediating abstraction for each such dependency, which would have the real (non-deterministic) implementation in prod, and an alternative (deterministic) implementation or adapter in tests.
That abstraction will usually need an API which is different, and richer, than the API provided by the actual system. For example, scheduling process A and then process B does not guarantee that process A will start running first; similarly, sending signal X and then signal Y to a process does not guarantee that it will receive those signals in that order. A deterministic abstraction would need to do extra work to ensure invariants like those. And that extra work doesn’t seem to generalize very well.
edit: I glanced at the issues related to determinism and test flakes, and it seems like many of them are related to process substitution, often in pipelines. Processes created in this way — cmd <(foo) <(bar) | baz <(quux) ... — are inherently nondeterministic (ref) which means you can’t really make assertions about any specific output or output-ordering. IMO this is best solved by fixing the test itself – if you need X to occur before Y, then you probably want to create some kind of execution fencing with e.g. wait.
Yeah I think the number of syscalls a shell makes is small enough that we could actually make a little simulator for it
These kinds of things are pretty nice to do in Python too
I was thinking something like the sqlite VFS abstraction, but it’s a bit harder than that. The problem is fork(), and the fact that important stuff happens between fork() and exec(). Our tests might need something funky like a notion of what happens in a child process
I think something like that could get us 50% of the way there
But signals and process groups would could require more of a kernel than I’d be willing to write, not to mention the mechanics of it :)
But I found some very recent work today that seems like it’s being used for real, and is open source, so I’ve put it in my queue to check out
rr can generate deterministic replays of concurrent programs, but I don’t know if it’s deterministic to begin with; could be something to look into in any event. (Probably impractical to be deterministic at the level of instructions, but maybe not for syscalls, so it or something like it might do just fine your purposes.)
It seems to me that, if you confine yourself to one seed for testing, then you are liable to overfit to it.
The comments talk about rr, which I know about but haven’t used for anything real. My experience is that separate processes are hard for almost all debuggers, but we’ll see
The idea would be to use different seeds, but given a single seed, you can reproduce the bug
This is a great 2014 talk about this strategy for a database:
Testing Distributed Systems w/ Deterministic Simulation” by Will Wilson
I’ve seen this, and it looks very interesting/credible … Although does it do anything for the kernel specifically ? Or is it more about concurrent systems on top of the kernel? i.e. I thought some of the simulation required some kind of sqlite VFS thing, not something that works with any piece of code
These release notes got long, so here are some messages to take away:
trap - SIGINT
behaves differently, a good bug found and root-caused by John Sooecho
builtin in YSH is too strict, which will lead to changesI’d also like to emphasize the call for ideas about deterministic testing of concurrent systems, where one component is the Unix kernel.
This seems like it should have been done before but I don’t know of any work on it. I’m tired of flaky tests.
I know there is a POSIX file system model, although it’s not clear to me if it’s actually practical for testing – https://www.cl.cam.ac.uk/~pes20/rems/
But what about process scheduling and signals? The shell gets nondeterministic waitpid(-1) and signal input from the kernel. Both the shell and the kernel maintain process state.
It seems to me that the Unix kernel is the OG concurrent system and there should be a solution for this :) I suggested maybe a User Mode Linux or NetBSD rump kernel – maybe they can run with a deterministic seed or something like that? And inject faults?
Has anyone done that?
AFAIK things like the scheduler or the signal delivery infrastructure are inherently non-deterministic, in the sense that their canonical APIs simply don’t offer enough control to consumers to provide deterministic behavior.
If you want determinism against these kinds of things, my experience is that you generally have to orchestrate it yourself, in your application. Typically that’s done with an intermediating abstraction for each such dependency, which would have the real (non-deterministic) implementation in prod, and an alternative (deterministic) implementation or adapter in tests.
That abstraction will usually need an API which is different, and richer, than the API provided by the actual system. For example, scheduling process A and then process B does not guarantee that process A will start running first; similarly, sending signal X and then signal Y to a process does not guarantee that it will receive those signals in that order. A deterministic abstraction would need to do extra work to ensure invariants like those. And that extra work doesn’t seem to generalize very well.
edit: I glanced at the issues related to determinism and test flakes, and it seems like many of them are related to process substitution, often in pipelines. Processes created in this way —
cmd <(foo) <(bar) | baz <(quux) ...
— are inherently nondeterministic (ref) which means you can’t really make assertions about any specific output or output-ordering. IMO this is best solved by fixing the test itself – if you need X to occur before Y, then you probably want to create some kind of execution fencing with e.g.wait
.Yeah I think the number of syscalls a shell makes is small enough that we could actually make a little simulator for it
These kinds of things are pretty nice to do in Python too
I was thinking something like the sqlite VFS abstraction, but it’s a bit harder than that. The problem is fork(), and the fact that important stuff happens between fork() and exec(). Our tests might need something funky like a notion of what happens in a child process
I think something like that could get us 50% of the way there
But signals and process groups would could require more of a kernel than I’d be willing to write, not to mention the mechanics of it :)
But I found some very recent work today that seems like it’s being used for real, and is open source, so I’ve put it in my queue to check out
https://lobste.rs/s/fzsxay/oils_0_15_0_big_contributions_more#c_0um7by
rr can generate deterministic replays of concurrent programs, but I don’t know if it’s deterministic to begin with; could be something to look into in any event. (Probably impractical to be deterministic at the level of instructions, but maybe not for syscalls, so it or something like it might do just fine your purposes.)
It seems to me that, if you confine yourself to one seed for testing, then you are liable to overfit to it.
I just found this very recent work today:
Hermit: Deterministic Linux for Controlled Testing and Software Bug-finding
https://news.ycombinator.com/item?id=33708867
https://developers.facebook.com/blog/post/2022/11/22/hermit-deterministic-linux-testing/
The comments talk about
rr
, which I know about but haven’t used for anything real. My experience is that separate processes are hard for almost all debuggers, but we’ll seeThe idea would be to use different seeds, but given a single seed, you can reproduce the bug
This is a great 2014 talk about this strategy for a database:
Testing Distributed Systems w/ Deterministic Simulation” by Will Wilson
https://www.youtube.com/watch?v=4fFDFbi3toc&t=1s
I think Tiger Beetle is doing some of that now, maybe @matklad can tell us about it :)
https://antithesis.com/
I’ve seen this, and it looks very interesting/credible … Although does it do anything for the kernel specifically ? Or is it more about concurrent systems on top of the kernel? i.e. I thought some of the simulation required some kind of sqlite VFS thing, not something that works with any piece of code