1. 14
  1.  

  2. 8

    Redlock assumes a semi synchronous system model where different processes can count time at more or less the same “speed”.

    This is flawed reasoning. See https://aphyr.com/posts/299-the-trouble-with-timestamps.

    1. 1

      You’re correct that it is a huge problem in distributed systems, but the reasoning itself is not flawed - just something that needs to be known and compensated for.

      Most of these sorts of systems that depend on distributed nodes having roughly the same ability to keep track of time handle the problem by using timeframes much longer than would be affected by small deviances in clock rates or synchronization. Using NTP configured properly, for example, it’s generally pretty safe to assume your clocks are in sync to about a couple milliseconds on a LAN, so if we just build into the protocol a safety factor much greater than any reasonable deviance you can accomplish good (but not perfect) safety. For example, acquire a lock for 30 seconds from timestamp A to B, after A wait 5 seconds, do your work to prepare to commit, but abort if B is in less than 5 seconds. Unless another system in the network is so far out of sync that it believes that that lock has already expired, the system should be safe.

      It does mean that your system is vulnerable to clock/network/config issues, though, so those need to be monitored. I once saw a system mistakenly configured to use some random external NTP source (which should’ve been blocked by the network config anyway, but was not) out of sync with the rest of the network by 10s of seconds. One of the nice things about NTP is that it’ll give you an estimate of how far out of sync it is - such nodes should remove themselves from service if they’re approaching safety bounds or their skew is unknown due to failures to sync.

      1. 6

        While your response might be true, Martin’s analysis covered it failry well by saying if mutual exclusion is important for the correctness of your system, Redlock is not a good choice. So saying things “should” be safe and hoping that your CPU doesn’t just hit a bug one day and start telling the time wrong is an insufficient guarantee. It still means that Redlock is broken (or at least that’s the claim) in the face of a known possible error mode, an error mode with known solutions.

    2. 6

      I want to mention again that, what is strange about all this, is that it is assumed that you always must have a way to handle the fact that mutual exclusion is violated. Actually if you have such a system to avoid problems during race conditions, you probably don’t need a distributed lock at all, or at least you don’t need a lock with strong guarantees, but just a weak lock to avoid, most of the times, concurrent accesses for performances reasons.

      I’m not sure I understand this paragraph. He is saying that there is no way to handle race conditions, and if you have a way to handle them you don’t need a distributed lock, but isn’t that what a distributed lock is attempting to solve?

      Despite starting off by talking about the exactness of mathematics, the end result seems to boil down to this statement:

      Can a process count relative time with a fixed percentage of maximum error? I think this is a sounding YES, and is simpler to reply yes to this than to: “can a process write a log without corrupting it”?

      And the reasoning behind this comes down to “Just don’t make mistakes”. What about hardware issues? What if you do make a mistake? That is why they are mistakes, after all. I think this reasoning is pretty weak and there is a lot of evidence to show that you simply cannot depend on it.

      1. 1

        I’m not sure I understand this paragraph. He is saying that there is no way to handle race conditions, and if you have a way to handle them you don’t need a distributed lock, but isn’t that what a distributed lock is attempting to solve?

        I think you mostly get it - I think what he’s trying to say is that there are other ways to handle race conditions than distributed locking, but if you’re using those then you don’t need a distributed lock in the first place or at least don’t need it to be strongly correct, as your other mechanism provides correctness.

        And the reasoning behind this comes down to “Just don’t make mistakes”. What about hardware issues? What if you do make a mistake? That is why they are mistakes, after all.

        When designing distributed systems, the system’s ability to function function does often depend on “don’t make mistakes” so that much is certainly true. The other part of it though is “detect and deal with mistakes” and that’s something that is possible in many distributed systems. For example, lets say you have a hardware fault which causes the clock to run incredibly slowly - this can be detected if NTP attempts to step the clock (out of sync by more than 128ms, so give up on slew and just reset the damn thing.)

        It is certainly true that undetected byzantine failures are a danger in distributed systems, but that’s pretty much always going to be the case. See, for example, the “FLP” paper: http://cs-www.cs.yale.edu/homes/arvind/cs425/doc/fischer.pdf The important thing about such results as FLP is that it is impossible to eliminate nontermination but that doesn’t mean it’s impossible to make systems that can make correct progress most of the time. Paxos or other consensus protocols may fail to accept some proposal due to a variety of reasons - but they won’t accept an invalid proposal and so in the general case proposals can be retried until the system recovers, although sometimes this recovery will require detection and removal of misbehaving nodes by some other external means. Sort of like the “CAP theorem”, in distributed systems you have properties around “liveness” (ability to make progress) and “safety” (ability to not accept incorrect messages). FLP proves that in an asynchronous system we can’t guarantee liveness, but we can still be safe and live when we’re not “making mistakes”.

        1. 1

          When designing distributed systems, the system’s ability to function function does often depend on “don’t make mistakes” so that much is certainly true

          Of course, nothing works if you do terrible mistakes, but the goal, IME, of any distributed system is to make the system less fragile. The issue, it seems, with Redlock is it makes the whole system more fragile.