It’s a nice piece. The only deficiency is toward the end on formal methods. The SPIN model checker has probably found more obscure protocol bugs, from system to hardware level, than about anything else. TLA+ has similarities to it with it finding all kinds of protocol bugs in projects that used it. They and/or papers/articles covering industrial use should probably be more prominent in references than the Coq work. The Coq stuff is relatively recent, hard to use, and barely used. Readers might be mislead as to what they should try. Answer is straight-forward: TLA+ due to easy tutorials + cost/benefits + tooling ecosystem; SPIN next due to proven use in industry w/ next-bast cost/benefit & tooling available; Coq-based stuff only if they know Coq, the protocol is relatively simple, and they have plenty of time.
EDIT to add references:
Thanks! Yeah, I agree, the post doesn’t really explain the model checking stuff. I was focusing more on checking implementations for correctness (there’s the formality gap when model checking, because people don’t actually test the implementation – e.g. Paxos has been verified in a model checker, but there are lots of buggy implementations out there).
Thanks for pointing out the SPIN model checker, I hadn’t read much about it before. And yeah, I totally agree, people should try model checking first. Verifying realistic implementations using proof assistants is impractical right now (the Verdi guys spent several person-years on Verdi and verifying Raft, and that was a really impressive engineering effort).
Another thing people might find interesting is how Amazon has made use of lightweight model checking in verifying their systems: http://lamport.azurewebsites.net/tla/formal-methods-amazon.pdf
Oh, yeah, Amazon is one of the main motivators I was thinking about. Their report generated a surge of new interest in TLA+. I also like the example they gave where the model-checker easily caught an error that would’ve taken 30+ steps in testing to reproduce. That really drives the value home.
Far as your concern about model-checking vs other methods, I submitted a link here before that confirms that which you might find interesting. It also is sort of a guide on what to look for with non-formal methods since those areas are known spots for trouble.
Oh neat, looks like an interesting read. Thanks!
Out of curiosity, when you evaluated Knossos performance, were you testing using histories with crashed clients, or happy-path histories where client operations succeed relatively quickly? Knossos makes some choices which optimize for the former, and I think the P-compositionality paper focuses on the latter, but it’s been a few years since I’ve been down in those papers guts. May need to revisit those assumptions if they were slower for your workload.
To make the comparison more fair to Knossos, I tested histories where you can’t take advantage of P-compositionality. (in short, P-compositionality is when a history is linearizabile iff all subhistories in a partitioned history are linearizable - e.g. with a map, you can partition by keys and check the subhistories independently, and that’s a lot faster)
I used test data from Jepsen etcd tests: https://github.com/anishathalye/porcupine/blob/master/test_data/jepsen
Here’s the quick-and-dirty benchmarking code I used: https://gist.github.com/anishathalye/a315a31d57cad6013f57d2eb262443f5 (basically, just timing knossos.core/linearizable-prefix).
Even where Knossos is slow, e.g. etcd_002.log and etcd_099.log (timed out after > 2 days). Porcupine seems to do fine, taking hundreds of milliseconds on a single core to check the histories.
Out of the ~100 tests, filtering for ones that Knossos finished in < 1 hour, we have a speedup of 1000x on Knossos’s fastest test (etcd_016.log) and a speedup of 20,000x on Knossos’s slowest test (etcd_040.log). And for the ones that timed out (because I didn’t want to run the tests for way too long), e.g. (etcd_099.log), Porcupine had a speedup of > 10^6.
I haven’t had time to look into Knossos’s implementation in detail and figure out exactly where Porcupine’s speedups are coming from, but that would be cool to do at some point. Jepsen/Knossos are obviously a way more complete solution for testing distributed systems, and it would be cool to speed up the linearizability checking aspect.
Ohhhhhh! Yeah, you’re using the original algorithm–that’s definitely slower. Try (knossos.linear/analysis model history) instead–that’s based on the JIT-linearization algorithm from Lowe’s paper, plus some additional optimizations–instead of performing union-find for compaction, we pre-compile the state space into a graph of pointer arrays, which turns the search into immutable pointer-chasing instead of running model code. There are… certain cases where the knossos.core algorithm is preferable (it offers better parallelism) but linear should be significantly faster. Still not good though; I’d like to sit down and figure out some alternative strategies.
And yeah, as you note, we don’t do P-compositionality in Knossos–that’s handled by Jepsen, which performs the decomposition into independent keys for maps, sets, etc, then hands individual histories to Knossos. Would be nice to fold into Knossos later though!
Last thing, if you wind up packaging this for distribution, I’d like to offer a hook in Jepsen so we can pass histories to it. If you’d like to define some sort of serialization format (JSON, tsv, EDN, protobufs, etc) for passing histories in and getting analysis results back, I can wrap that up in a Jepsen checker as an alternative strategy. :)
Oops, I didn’t know that. I redid the benchmarking with (knossos.linear/analysis model history), running Knossos on 6 cores and Porcupine on 1 core.
The benchmark results did change: Knossos completed every check significantly faster. On some tests, the new algorithm performed significantly better: e.g. on etcd_040.log, Porcupine has a speedup of 278x, as opposed to a speedup of 20,000x when comparing against the original algorithm (knossos.core/linearizable-prefix).
Porcupine still ran faster on all tests, though; the following is a summary of the results (over the ~100 Jepsen etcd tests):
Min speedup: 8.1x on etcd_002.log
Max speedup: 21,219x on etcd_067.log
Median speedup: 1000x
Ooh, that sounds cool! I’ll probably end up packaging this for distribution in a couple weeks, and I’ll definitely reach out to you once I have an API for communicating with Porcupine.