1. 15
  1.  

  2. 9

    I have to admit, I did raise an eyebrow at the idea that Tigerbeetle’s correctness relies on well-behaved wall clocks during VM migration–in a database which also explicitly claims strict serializability. Viewstamped replication is an established consensus protocol, so that’s not a red flag per se, but we know from Lamport 2002 that consensus requires (in general) at least two network delays. However, Tigerbeetle claims “We also can’t afford to shut down only because of a partial network outage”, which raises some questions! Do they intend to provide total availability? We know it’s impossible to provide both strict serializability and total availability in an asynchronous network, but in a semi-sync model, maybe they can get away with something close, ala CockroachDB, where inconsistency is limited to a narrow window (depending on clock behavior) before nodes shut themselves down. Or maybe they’re only aiming for majority-available (which is what I’d expect from VR–it’s been a decade or so since I’ve read the paper and I only dimly remember the details), in which case… it’s fine, and the clocks are… just an optimization to improve liveness? I dunno, I’ve got questions here!

    1. 11

      Thanks for the awesome questions, @aphyr!

      The clocks are… just an optimization for the financial domain of the state machine, not for the consensus protocol.

      As per the post, under “Why does TigerBeetle need clock fault-tolerance” we explain that “the timestamps of our financial transactions must be accurate and comparable across different financial systems”. This has to do with financial regulation around auditing in some jurisdictions and not with total order in a distributed systems context.

      As per the talk linked to in the post, we simply need a mechanism to know when PTP or NTP is broken so we can shut down as an extra safety mechanism, that’s all. Detecting when the clock sync service is broken (specifically, an unaligned partition so that the hierarchical clock sync doesn’t work, but the TigerBeetle cluster is still up and running) is in no way required by TigerBeetle for strict serializability, it’s pure defense-in-depth for the financial domain to avoid running into bad financial timestamps.

      We also tried to make it very clear in the talk itself that leader leases are “dangerous, something we would never recommend”. And on the home page, right up front (because it’s important to us) we have: “No stale reads”. If you ever do a Jepsen report on TigerBeetle, this is something I guarantee you will never find! :)

      You might also recall that I touched on this in our original email discussion on the subject back on the 2nd July, when I suggested that CLOCK_BOOTTIME might be better for Jepsen to recommend over CLOCK_MONOTONIC going forward, and where I wrote:

      “If you’re curious why we were looking into all of this for TigerBeetle: We don’t plan to risk stale reads by doing anything dangerous like leader leases, but we do want a fault-tolerant clock with upper and lower bounds because […] these monotonic transaction timestamps are used […] for timing out financial two-phase commit payments that need to be rolled back if another bank’s payments system fails (we can’t lock people’s liquidity for years on end).”

      You can imagine TigerBeetle’s state machine (not the consensus protocol itself) as processing two-phase payments that look alot like a credit card transaction that has a two-phase auth/accept flow, so you want the leader to timestamp roughly close enough to true time, so that the financial transaction either goes through or ultimately gets rolled back after roughly N seconds.

      Hope that clarifies everything (and gets you excited about TigerBeetle’s safety)!

      P.S. We’re launching a $20K consensus challenge in early September. All the early invite details are over here and I hope to see you at the live event, where we’ll have back-to-back interviews with Brian Oki and James Cowling: https://www.tigerbeetle.com/20k-challenge

      1. 3

        This has to do with financial regulation around auditing in some jurisdictions

        Aha, thank you, that makes it clear. You’re still looking at potentially multi-second clock errors, right? It’s just that they’re limited to the window when the VM is paused, instead of persisting indefinitely?

        that leader leases are “dangerous, something we would never recommend”

        You know, I’m delighted you say this, because this is something I’ve discussed with other vendors and have encountered strong resistance to. Many of my clients have either felt the hazard didn’t exist or that it’s sufficiently improbable to worry about. I still don’t have any quantitative evidence to point to which suggests how frequently clock desync and/or skew occurs, or what kinds of user-facing impact resulted. I think CockroachDB choosing to up their default clock skew threshold to 250 (500?) ms was suggestive that something was going wrong in customer environments, and I’ve had private discussions to the effect of “yeah, we routinely see 100+ ms offsets in a single datacenter”, but that’s not something I can really cite. If y’all have done any sort of measurement to this effect, I’d love to read about it.

        in our original email discussion

        My terrible memory strikes again! Yes, I do recall this now, thank you. :-)

    2. 2

      Most of the time, Network Time Protocol (NTP), would correct our clocks for these errors. However, if NTP silently stops working because of a partial network outage, and if [database] keeps transacting, then we would be running blind

      Maybe an ignorant suggestion, but this excerpt suggests that as long as NTP is working correctly then their database doesn’t need to worry. If that’s so, perhaps they could write a bulletproof NTP daemon and solve this issue once?

      1. 2

        The problem comes in where the network fails (or partitions itself) in different ways for the NTP daemon versus for TigerBeetle, especially because NTP is hierarchical whereas TigerBeetle is a 3 or 5 node cluster. We need to keep the partition faults aligned, which is why we need a detection mechanism aligned with TigerBeetle’s consensus so it can stop when NTP (or any clock sync service) effectively stops (as seen by no agreement on cluster time amongst the cluster clocks). It’s purely a fail-safe mechanism, that’s complementary to the clock sync service. Partitions are always possible, so we can’t write a bulletproof NTP daemon (we’re also not trying to replace the clock sync service). At the same time, some partitions may only affect a minority of the cluster whereas other partitions may affect a majority of the cluster. There’s no need to shutdown the cluster if a majority of nodes have good clocks and only a minority of nodes are partitioned.