I usually charge for dedicated weeks of time, and we keep going as long as the client feels the work is fruitful, but it varies. Some work I try to do for free, or with a pre-arranged rate in installments.
A careful investigation of the raw history reveals that just prior to the read of 4, process 98 attempted to write 4, and received a failure code :unavailable:
Isn’t it always legal for any DB client to claim that an operation failed when it actually succeeded? e.g. the op may have succeeded but the acknowledgement from the DB server got lost. I feel like I missed something obvious.
Yes, this is a subtle question! Two-generals implies that we cannot determine whether or not a message was received in a finite number of messages. This means that any request-response pattern has three possible outcomes: a definite success, a definite failure, or an indeterminate result (e.g. crash, timeout, …) which could be either successful or not; we can’t say. Jepsen is careful to treat each of these cases appropriately. In this particular case, the database returned a definite failure error, rather than an indeterminate one. When databases say something definitely didn’t happen, we hold them to it.
So if it had returned an error code for which the documentation said “your update might have been applied despite this error. We can’t tell for sure” then you would have made the linearizability testing treat that as indeterminate rather than success or failure, and that history would have been considered legal?
Yes, exactly. By default, Jepsen treats all errors as indeterminate ones, and we allow both possibilities. You have to explicitly tell Jepsen that a certain error is a definite failure, and then it won’t consider that operation as one that could have happened, when it goes to check for correctness. :)
Great writeup as always @aphyr!
Out of curiosity, what sort of pricing do you give for this sort of work, and is it per-time-period or per-product or per-defect found or what?
Thank you.
I usually charge for dedicated weeks of time, and we keep going as long as the client feels the work is fruitful, but it varies. Some work I try to do for free, or with a pre-arranged rate in installments.
Have you done CockroachDB yet?
Yep! Review is here: https://jepsen.io/analyses/cockroachdb-beta-20160829 (full disclosure: I work at cockroach)
Yet another great read. Thanks Kyle.
I have what feels like a really stupid question:
Isn’t it always legal for any DB client to claim that an operation failed when it actually succeeded? e.g. the op may have succeeded but the acknowledgement from the DB server got lost. I feel like I missed something obvious.
Yes, this is a subtle question! Two-generals implies that we cannot determine whether or not a message was received in a finite number of messages. This means that any request-response pattern has three possible outcomes: a definite success, a definite failure, or an indeterminate result (e.g. crash, timeout, …) which could be either successful or not; we can’t say. Jepsen is careful to treat each of these cases appropriately. In this particular case, the database returned a definite failure error, rather than an indeterminate one. When databases say something definitely didn’t happen, we hold them to it.
Thank you! ❤
So if it had returned an error code for which the documentation said “your update might have been applied despite this error. We can’t tell for sure” then you would have made the linearizability testing treat that as indeterminate rather than success or failure, and that history would have been considered legal?
Yes, exactly. By default, Jepsen treats all errors as indeterminate ones, and we allow both possibilities. You have to explicitly tell Jepsen that a certain error is a definite failure, and then it won’t consider that operation as one that could have happened, when it goes to check for correctness. :)
Thank you. ♥