1. 70
  1. 9

    Thanks for posting this. I found the parts about distributed locking and correctness vs. liveness thought provoking and timely for me.

    Readers who found the distributed locking discussion interesting would also like [1]. A common pattern I’ve seen in AWS is using the DynamoDB lock client [2] just as Aphyr suggests, not to guarantee mutual exclusion but more as a best-effort to limit concurrent operations on some resource, and then delegating to the storage layer or some other service to guarantee idempotency and safety. In one instance over the past few months I witnessed an internal service that misused the DynamoDB lock client and sacrificed liveness in favor of safety, it just stopped doing work when another issue occurred.

    Disclaimer: I work for AWS but my opinions do not reflect my employer’s. I’m not an expert at distributed systems and always learning.

    [1] https://aws.amazon.com/builders-library/leader-election-in-distributed-systems/

    [2] https://github.com/awslabs/dynamodb-lock-client/blob/master/README.md

    1. 6

      thanks for doing this aphyr! interesting as always

      1. 2

        The date is today’s date, but 2019, is this a year old?

        1. 11

          This is what I get for writing “2019-TODO” in October and only updating the TODO part before release!

          Should be fixed momentarily, I’ve just been waiting on gcloud to pick up the changes. Takes forever.

          1. 1

            A glimpse into how long it takes to work on these! Very curious about this now, actually…

            1. 35

              Each report generally involves 2-3 months of my full-time work before the report is “finished”, and from that point, there’s an up-to-three-month delay before publication, which gives folks a chance to make bugfixes, inform users of risks, test and cut releases, etc. I check in a day or two prior to publication to make last-minute updates–“this issue is resolved by version foo”, etc.

              There’s generally 1-2 days of research and documentation review, about a week to write a basic “it turns on” test harness for a new database, and another 2-16 weeks of exploration, refinement, new workloads, failure modes, etc. Throughout this process I’m in close contact with the client, showing them my findings, asking for help understanding expected behavior, suggesting documentation fixes or possible algorithmic issues, debating whether something is “really broken”, etc.

              I spend about a week actually writing each report–sometimes multiple weeks, depending on the complexity. Sometimes I write at the end based on my “lab notebook”–really, a text logfile. Other times, I try to write throughout the process. Concurrent with the writeup, there’s a lot of work involved in collecting type specimens for each error, condensing those specimens to something you can actually understand in the paper, writing up bug tracker tickets, figuring out which of the 20-odd development builds I tested included/fixed which bugs, etc.

              Client comments can add additional weeks of back-and-forth; with some clients I just get a “looks good, ship it!”, and with others, we go through literally dozens of drafts fine-tuning language and debating how to visualize or label different issues. I review my own work for structure, language flow, and typos with at least three passes, spoken out loud. For a ~10 page paper, that’s multiple hours, possibly more than a day, of work each pass.

              And despite my own proofreading, as well as having a half-dozen client reviewers, and occasionally peers, we still sometimes miss obvious things like “what year is it”! ;-)

              1. 1

                I’ve always wondered about your process to get these out. What percentage of the systems you test are open as opposed to proprietary systems about which you don’t post online?

                1. 1

                  Good question! I was worried, when I started out, that folks would try to weasel their way into keeping analyses from the public eye, and wrote a policy specifically to address this: https://jepsen.io/ethics.

                  Whether a system is open vs proprietary is a different question from whether I release a public analysis. I test completely open-source systems and completely closed-source ones alike, and they both get the same treatment at release time: test suite, examples, and the report are free for everyone.

                  Basically, when folks ask for Jepsen work, they choose whether they’d like an actual public analysis or not. If they say yes, it is (I mean, assuming I can physically do the work, get paid, etc) getting published. Almost everyone chooses public. I think the last private Jepsen-analysis work I did was… one week in 2017?

                  In addition, sometimes I do internal consulting, like reading docs or talking to engineers about their plans for a DB or infra. And there’s classes, trainings, and internal tech talks, most of which I don’t talk about publicly. I don’t think that stuff matters as much as my analysis work though–it’s generally not of public interest that I taught at FooCorp. :)

                  1. 1

                    Thanks aphyr! Nice to know that the percentage is quite heavy on the open side.

                    Since you mentioned trainings, would love to be in one of our distsys classes. I’m not sure if my employer would be willing to sponsor a class but if I could attend one of your classes that you organize in the open or with employers, that’d be great! :)

                    1. 1

                      They’re welcome to talk to me about running a class–I don’t have any open sessions planned right now though.