For testing byzantine faults, it’s important to keep results actionable. https://github.com/madthanu/alice does a nice job of introducing realistic filesystem semantics for crash testing. You may need to run it on the ubuntu version they recommend, as their patched strace is a little out of date. It’s a pretty simple tool to use other than that!
Jepsen is a distributed systems test first and foremost, but yes, for single-node faults, tools like Alice are nice. I actually spent a week or so on a research project involving filesystem-level faults, but it didn’t product useful results within the time I had available.
Recovery correctness has a ton of implications for distributed systems. It’s vital for leader election in broadcast-replicated systems with progress-related invariants, like what raft needs to enforce. I’ve also come across several popular (purportedly linearizable) distributed databases that will do things like start serving before their recovery process completes, returning stale reads just behind the in-progress recovery scan of the WAL. You’ll find gold if you look.
Wow thank you, that sounds like a really interesting research opportunity!