1. 23
  1.  

  2. 3

    I quickly realized that without extensive unit test coverage (which I was not interested in writing for a learning experience), it was nearly impossible to determine whether I had achieved real correctness

    Unit tests don’t help very much for this kind of thing. When I write consensus algorithms, I build them on a simulator that chooses seedable pseudorandom delivery times (or drops) for messages sent from any node to another, iterating over the next scheduled message until a desired budget.

    It’s sooooo powerful to build distributed algorithms around an interface like this (in any programming language):

    fn receive(
        current_time,
        incoming_message,
        from_address
    ) -> [(outgoing_message, to_address)];
    
    fn tick(current_time) -> [(outgoing_message, to_address)];
    

    My favorite recipe:

    • use quickcheck to generate a cluster with a set of weather events (partitions at certain times, slowness, probability of drops, etc…) that have a source address, destination address, start time, and end time
    • generate a set of participants that implement the above interface
    • generate a set of incoming client requests from a special client address, to a certain instance at a certain time
    • fuzzily call tick on the participants to get them to run any leader election logic etc…
    • for all outgoing messages (the return value of both functions in the interface) choose whether to drop the message based on the generated partition/weather set, and if not, choose a delivery time in the future. throw the message into a priority queue keyed on the delivery time
    • iterate over the priority queue, zipped with fuzzy+lossy tick intervals, calling the receive function for participants receiving a message, and tick for participants who have their tick function called
    • after a certain time is up, look at any responses that clients received. these can be treated in a similar way as a concurrent single node system calling a concurrent data structure, in terms of simple linearizability testing! this is so powerful! you can very easily know if your system is linearizable or not by permuting the different requests (bounding permutation by intervals where a response was observed before a request was received) and trying to find an ordering where all of the observed client responses are able to be received when being sent serially. if not, the system is not linearizable.
    • you can step through this in logical time based on generated delivery times. this lets us run the simulation many thousands of times faster than we could run it in a “real cluster”

    A Jepsen test takes 5 minutes to do a single run (after spending a month+ setting up the testing infrastructure and writing the code). We can do 10000 of these tests per second, and get a far gnarlier weather history. This lets engineers building the system run the simulation as part of a standard testsuite that they execute locally before opening a PR, and then getting a clear cluster trace of how things happened that should not have, when desired invariants are violated.