Gods Jepsen is inspiring. How many people out there get paid to do high-quality, detailed independent research on industrial software and then publish it for free? Those are some fuckin’ life goals right there.
The deep insight here is that the client and server together provide the user guarantees.
Two consequences of this insight:
Users don’t really care if a bug happens at the client or the server. A bug is a bug. You have to fix issues at both layers.
You can solve problems using code at both the client and server. The right split of logic makes the whole system simpler, faster, more robust. The key boundary is the network, with its chaotic implications (request failure, re-ordering, duplication, corruption).
Your application starts at the call to the client library.
This is, as always from Jepsen / @aphyr, great work. And it strongly makes its central point that
From the perspective of an application, the client and servers together are the system. Safety errors in the client library can be indistinguishable from those in servers.
If I might be so bold as to suggest one possible further improvement for future posts, I did find the diagrams in this post less useful than in most Jepsen posts: this time, the problem isn’t in any of the arrows (quite a few bugs found by Jepsen appear to ultimately be fixed by adding a lock to enforce a dependency/“arrow”), but - spoiler! - in the assumption that a transaction submitted to the library will be executed at most once. That might be better visualized in a different fashion (just a list of actually-executed serial transactions?)
(Aside: my commiseration on the situation described in footnote 1. That must have been frustrating to all involved!)
I thought about this one a bit–you can imagine drawing the “true serial execution” including retried transactions, but if you do that, the anomaly actually disappears! These diagrams make it obvious that the user-observable history actually does have a circular flow of information. I thought about including alternate versions of the diagrams, but decided that the prose explanation was clear enough on its own.
Clients routinely engage Jepsen for private analysis work; that’s just fine and the work is often never published. This was an odd case. Jepsen was asked to take a few days to re-run an existing test suite against the then-current version of etcd. We didn’t have budget or time for a full public analysis, but I volunteered to do that work–polishing the test suite, investigating anomalies, collecting exemplars, writing a report, editing, etc.–on my own time, after the engagement. The client was enthusiastic, but once I finished the report, the client’s legal department wouldn’t authorize the acknowledgement.
I’m not exactly sure what to change–in 9 years of doing this work it’s only happened once! One thing I’m considering is that when Jepsen takes on a client that asks for analysis of third-party systems, we include explicit contract terms giving permission to publicly acknowledge the company in any possible written report as having funded research into that system. That also avoids any sense of a system having been “tainted” by having some privately-funded contribution to its test suite.
In broader terms, I’m not sure what constitutes a meaningful contribution requiring a funding acknowledgement. This is generally simple: a single company, usually the vendor, funds the development or extension of a test suite and the time to investigate and write the report. That company gets acknowledged as the funder. But in a broader sense, something like 50-100 clients have funded improvements to Jepsen’s core libraries. At some point those contributions become diffuse–the library is a general-purpose commons and the clients generally have no idea what databases I’ll go on to test. Database-specific test suites also become commons, over time: the etcd test suite includes years of independent work by Jepsen, work funded by Stripe, work funded by the Linux Foundation, and work funded by an undisclosed client. My sense is that the purpose of a funding disclosure is to acknowledge why an investigation happened, rather than every past contributor, and that suggests Jepsen ought to disclose funding more on the basis of who paid for the investigatory work, rather than all past contributors to the software used in that investigation.
Gods Jepsen is inspiring. How many people out there get paid to do high-quality, detailed independent research on industrial software and then publish it for free? Those are some fuckin’ life goals right there.
The deep insight here is that the client and server together provide the user guarantees.
Two consequences of this insight:
Your application starts at the call to the client library.
E.g., rather than setting up a complicated multiregion HA load balancer, consider whether the client library could just try multiple endpoints.
These discussions always bring to mind the classic paper about End-to-end Arguments in System Design.
And users are equally happy to see a problem solved regardless of whether it is the client or server that fixed it. :)
This is, as always from Jepsen / @aphyr, great work. And it strongly makes its central point that
If I might be so bold as to suggest one possible further improvement for future posts, I did find the diagrams in this post less useful than in most Jepsen posts: this time, the problem isn’t in any of the arrows (quite a few bugs found by Jepsen appear to ultimately be fixed by adding a lock to enforce a dependency/“arrow”), but - spoiler! - in the assumption that a transaction submitted to the library will be executed at most once. That might be better visualized in a different fashion (just a list of actually-executed serial transactions?)
(Aside: my commiseration on the situation described in footnote 1. That must have been frustrating to all involved!)
I thought about this one a bit–you can imagine drawing the “true serial execution” including retried transactions, but if you do that, the anomaly actually disappears! These diagrams make it obvious that the user-observable history actually does have a circular flow of information. I thought about including alternate versions of the diagrams, but decided that the prose explanation was clear enough on its own.
Footnote 1 is interesting: https://jepsen.io/analyses/jetcd-0.8.2#fnref1
To my knowledge, this is the first time a client does not want to give official permission to publish the findings?
In any case, will that lead to changes in the contract negotiations for future efforts?
Clients routinely engage Jepsen for private analysis work; that’s just fine and the work is often never published. This was an odd case. Jepsen was asked to take a few days to re-run an existing test suite against the then-current version of etcd. We didn’t have budget or time for a full public analysis, but I volunteered to do that work–polishing the test suite, investigating anomalies, collecting exemplars, writing a report, editing, etc.–on my own time, after the engagement. The client was enthusiastic, but once I finished the report, the client’s legal department wouldn’t authorize the acknowledgement.
I’m not exactly sure what to change–in 9 years of doing this work it’s only happened once! One thing I’m considering is that when Jepsen takes on a client that asks for analysis of third-party systems, we include explicit contract terms giving permission to publicly acknowledge the company in any possible written report as having funded research into that system. That also avoids any sense of a system having been “tainted” by having some privately-funded contribution to its test suite.
In broader terms, I’m not sure what constitutes a meaningful contribution requiring a funding acknowledgement. This is generally simple: a single company, usually the vendor, funds the development or extension of a test suite and the time to investigate and write the report. That company gets acknowledged as the funder. But in a broader sense, something like 50-100 clients have funded improvements to Jepsen’s core libraries. At some point those contributions become diffuse–the library is a general-purpose commons and the clients generally have no idea what databases I’ll go on to test. Database-specific test suites also become commons, over time: the etcd test suite includes years of independent work by Jepsen, work funded by Stripe, work funded by the Linux Foundation, and work funded by an undisclosed client. My sense is that the purpose of a funding disclosure is to acknowledge why an investigation happened, rather than every past contributor, and that suggests Jepsen ought to disclose funding more on the basis of who paid for the investigatory work, rather than all past contributors to the software used in that investigation.
Lots to think about.
Thanks for the long answer and for humoring my curiosity! Makes sense to treat it that way and make it visible.
That is interesting behavior, especially from a CNCF (https://cncf.io/) company, who you’d hope would be more open to security/reliability concerns.
Stale bug policies continue to be a plague on the industry