1. 2
  1. 3

    Worth mentioning: this might work some of the time, but in an asynchronous (e.g. real) network, it could be unsafe. The described locking scheme does not actually ensure no two processes hold the lock at the same time. Even if it did, it would not ensure that side effects, like writing to block storage, would be safe. Martin Kleppmann has a terrific overview of why “distributed locks” generally don’t do what people think, and what to do instead: https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html

    1. 1

      Thanks for this feedback. Actually my design already takes this problem into account.

      Kleppmann warns that a process can freeze for an arbitrary amount of time due to garbage collection, network problems, CPU starvation, etc. I mitigate this in my design through two ways:

      1. By recommending a long TTL, in the order of minutes. Kleppmann cited a problem at Github where packets were delayed for 90 seconds. The default TTL I recommend is 5 minutes.
      2. By refreshing the lock from time to time, and by checking for its health periodically. The fresh- and healthcheck interval must be sufficiently short as to prevent the very problems described by Kleppmann.

      It’s described in section “Long-running operations”.

      Kleppmann proposes the use of fencing tokens, which indeed guarantees that such problems don’t occur, but it requires sufficient support by all systems. My design doesn’t provide as strong of a guarantee, but we can make it arbitrarily approach 100% safety by configuring the various timing settings (TTL, refresh interval).