Threads for jorangreef

  1. 2

    In hindsight, data logic should be in the database itself.

    This is the reason we are creating TigerBeetle [1] at Coil, as an open source distributed financial accounting database, with the double entry logic and financial invariants enforced through financial primitives within the database itself.

    This is all the more critical for financial data, because raw data consistency is not enough for financial transactions, you also need financial consistency, not to mention immutability.

    The performance of doing it this way is also so much easier. For example, around a million financial transactions per second on commodity hardware, with p100 latency around 10-20ms.


    1. 5

      Quoting p100 latency instead of p99 is a nice touch. :)

      1. 5

        :) Thanks. p100 is nice and easy to measure and there’s nowhere to hide. I learned it through firefighting an incident on a JS system with a 32 GiB heap — the p99 was a few ms but the p100 was a few minutes. After that I became much more interested in taking the max!

    1. 7

      Here’s Dominic Giampaolo from Apple discussing this back in 2005, before Linux fixed fsync() to flush past the disk cache —

      Here’s also a TigerBeetle thread I wrote on this last year, with more of the history and how various projects like LevelDB, MySQL, SQLite and different language std libs handle fsync on macOS —

      1. 3

        This is great info, thanks!

        I’m curious why you published it on Twitter rather than a blog/forum? Twitter is really super-awkward to read because everything’s broken up in little pieces, and after 30 seconds or so the page is blocked by a pop-up telling me to log in if I want to keep reading. 🤬

        1. 2

          It’s a pleasure! The thread was originally part of a Twitter poll:

          Would love to write this up more when I get time, along with some of our other storage fault stuff over at — if you’re curious to learn more, you can get a glimpse there and also in our linked GitHub repo of some of the storage fault challenges we’re solving and how.

        2. 2

          People seem to be calling fcntl first and then falling back to fsync if that returns an error. What I do is call fsync first, then fcntl; any idea why that would be suboptimal? I figure it won’t take any longer because when fcntl runs the kernel cache has already been flushed, so it’s not like anything gets written twice.

          1. 2

            The fsync should only be a fallback in case of fcntl error. It costs an extra context switch to make the extra syscall, and with faster storage devices like NVMe SSD only slightly slower, a context switch means you could be doing another I/O instead. It’s also extra bookkeeping for the OS besides the context switch.

            This is speculation from here on, but I’m guessing you might hit some rare edge cases if you swap the order. For example, either the fsync/fcntl might have some intrinsic delay regardless of whether there is data to sync, or might make the fs do something weird. It might even be that macOS has no idea what to tell the device to flush if it’s own page cache has already been flushed. There was a similar bug like this a decade or two back on Linux relating to O_DIRECT, so I imagine it’s not far out that unconditionally combining both fsync and fcntly might be interesting.

            Unless there’s a reason not to, I would follow the status quo, so that large projects like SQLite can act as a canary for the same course of action, rather than taking a more interesting path that might never receive as many eyes.

        1. 3

          Thanks to the author for an excellent post!

          I have to caveat this post with the fact that we are ignoring large swathes of failure modes.

          I work on a database [1] that has a strict storage fault model, and so I thought I’d add a few of these failure modes to the discussion, for example where I/O operations to and from disk may be: corrupted, misdirected to the wrong sector when writing or reading, rendered impossible because of a latent sector error, no-ops, performing no actual I/O while at the same time reporting success, significantly delayed, sometimes by several seconds (gray failure).

          Beyond the obvious:

          A corrupt log checksum does not always imply a system crash, fsync() on macOS [2] does not in fact flush to the disk “platter”, and data may be corrupted at any time during its storage lifecycle (before being written, while being written, after being written, and while being read).

          Disk sector writes are also not atomic, for example, an Advanced Format 4096 byte sector write to a disk with an emulated logical sector size of 4096 bytes but a physical sector size of 512 bytes is not atomic, and would be split into 8 physical sector writes, which may or may not be atomic.

          The Linux kernel page cache is not always reliable and may misrepresent the state of data on disk after an EIO or latent sector error, and file system metadata (such as the size of the log file as stored in the inode) as presented by the kernel is not always reliable and may change at any time (the inode can simply be corrupted on disk). There should always be redundant storage of critical metadata such as the log file size—storing this out of band and not trusting in inodes or the filesystem. You need to treat the disk as a block device essentially.

          However, the solution to most of these problems is “use checksums,” which is not that exciting, which is why I’m omitting it from this discussion.

          A better solution I believe is simply to start with a clearly defined storage fault model and to lean on the storage fault research that’s out there. In fact, the research suggests that systems that reduce the problem space largely to checksums do not tend to get storage faults right, and checksums (while critical) are the least of the solution.

          In fact, a large part of the solution after a clear model for safety is to use O_DIRECT [3], and to design protocols (and the whole system) so that they are storage fault-aware [4]. This means even as far out as the distributed consensus layer. Local storage faults propagate across distributed systems and break formal proofs of correctness for protocols such as Paxos and Raft [4], because these assume that stable storage is non-byzantine, so that replicated redundancy does not in fact imply fault-tolerance [5]. A single disk sector failure on one machine can result in global data loss or cluster unavailability.

          One also has to be careful to disentangle crashes from corruption when recovering from the log — to know whether to truncate or to repair — the checksum alone can’t make this distinction [4], and care needs to be taken when handling read/write syscall return values that indicate partial sector input/output especially where this interacts with different physical/logical sector sizes [6]. Another thing that’s critical for successful log recovery is to be able to read “around” faulty sectors to preserve durability and surface as much data as possible to higher layers [7].

          Checksums also need to be daisy-chained or hash-chained, like ZFS does, all the way back to the copy-on-write superblock. Otherwise, a read may simply read the wrong block without getting a checksum failure. There also needs to be proactive scrubbing of disks to detect and repair storage faults before they take out all copies. Checksums are of limited use if there’s no scrubbing, no fault detection in the first place.

          Cool, our hypothesis has been validated. This should be easy to fix: if we sync the log when we open it, we should guarantee that no unsynced reads are ever served

          Here’s an example of this in Zig [8] that Alex Miller of FoundationDB pointed out to me.

          If we want to improve our throughput, we need to find a way to turn multiple logical writes (Commands) into a single physical “batch.”

          This is also our own experience, and how TigerBeetle is able to process more than a million financial transactions a second. We reduce the problem down to a matter of: how fast can we write sequentially to disk, and how fast can we send data across the network, while maintaining stability and working around slowly failing hardware.

          The easier way to think about this is that the MEMTABLE (or some other indexed structure) is the afterthought. It’s just the cache in front of where the real action is: the log.

          This is where TigerBeetle is at now. We literally have a deterministic MEMTABLE—deterministic so that we can do autonomous testing, ala FoundationDB—in front of the consensus log. Every consensus operation, we ratchet the LSM-Tree a little, deterministically, so that compaction is incremental and streaming without any foreground latency stalls, avoiding the need for write throttling. [9]

          We’re coding this live on Twitch everyday (in Zig!) if you’d like to drop by and chat!


          [2] fsync on macOS flushes to disk cache only —

          [3] “Can Applications Recover from fsync Failures?” —

          [4] “Protocol-Aware Recovery for Consensus-Based Storage” —

          [5] “Redundancy Does Not Imply Fault-Tolerance” —

          [6] Partial logical sector reads/writes even when using O_DIRECT —

          [7] Reading around faulty sectors —

          [8] Opening a file safely —

          [9] Introducing TigerBeetle’s LSM-Forest —

          1. 3

            It’s a four-hour video - is there any summary/TL;DR or write-up available?

            1. 2

              Yes, the first hour is the high-level tour, the next hour is a code walk-through and the last two are pair-programming. We’ll have a 10-minute TL;DR talk on the LSM-Forest with Q&A for Jamie Brandon’s in April.

              In the meantime:

              1. A million financial transactions per second in Zig (45-min talk covering fault models and features)
              2. Deep dive into our consensus protocol (45-min talk with Aleksey Charapko’s DistSys reading group)
            1. 2

              This looks great! I’m all for database-centric architectures. am a big fan of Materialize (right in the process of introducing it in a project I’m working in) and followed Eve closely when it was still alive. (I created the little bouncing ball example that became a benchmark of sorts towards the end.) Will definitely try to participate in this!

              1. 2

                What are Materialize and Eve? I’m puzzled by “little bouncing ball” and relationship with databases — that sounds pretty interesting!

                1. 4

                  Materialize is a streaming database based on differential dataflow, from Frank McSherry and team. McSherry is well-known for his “Cost that outperforms single-threaded” (COST) paper [PDF].

                  1. 3

                    This is one of my favorite papers from Frank McSherry. The other being “A COOL AND PRACTICAL ALTERNATIVE TO TRADITIONAL HASH TABLES” which is a great intro to Cuckoo hashing, and a great analysis:

                1. 3

                  Really excited for this! Just submitted a talk around a new deterministic LSM-tree storage engine for TigerBeetle [1] written in Zig and using io_uring with some new ideas for pipelined/incremental/streaming compaction and how the storage engine integrates with the distributed consensus protocol. @ifreund and myself did a live stream [2] giving a high-level overview of some of these ideas today. Would love to talk more at hytradboi. Also awesome to see Frank McSherry on the list!



                  1. 2

                    That sounds really interesting. I’ll check out the video.

                  1. 2

                    I’ve experimented a little with the Eytzinger layout … problem is, the data I’m searching isn’t integers, it’s strings, so each array item is really a pointer to a string. I’m sure the string comparisons take so long they outweigh the differences in layout discussed here.

                    1. 1

                      Could you inline most strings that fit in e.g. 63 bits assuming you’re searching things like words in a dictionary? That way you can use the high bit to indicate a pointer jump to strings that don’t fit?

                      This would be the hash table equivalent of open-addressing with a little bit of chaining for the bigger stuff.

                      1. 2

                        I might try something like that. But it turns out the biggest performance constraint of all in my code is the space the table takes up, because it lives in a page in a b-tree, and the more keys I can fit on a page, the less disk I/O there is.

                        I keep trying to make my code fast, but when I look at profiles, they’re all dominated by pread and pwrite. So I really shouldn’t be reading papers like this one!

                    1. 3

                      Here’s the companion talk by Pat Morin:

                      The Eytzinger layout and search has got to be the coolest data structure and most elegant algorithm I’ve learned in a long time. Plus, such a great “by-the-way” dive into all the essential elements of writing high performance code, and how mechanical sympathy trumps complexity analysis, in a single case study that has application to so many computer science problems.

                      1. 3

                        Coincidentally, Brian Oki, the author of this paper, did an interview last month with Loris Cro and myself on Zig SHOWTIME that was crystal clear:

                        If you’re a fan of the 2012 revision of the protocol, then the interview linked above is also followed by a roundtable with James Cowling, author of the 2012 paper, which was also discussed recently at Aleksey Charapko’s DistSys Reading Group:

                        1. 2

                          Just to share a little backstory on how this came about:

                          Back in 2019, I was doing some security work and had alot of fun—learning and earning bounties for exploits in e.g. OpenSSL, Fastmail and Google Chrome. This adversarial approach stayed with me, so when it came time to implement TigerBeetle’s consensus protocol, we thought it would be really cool if people could learn consensus by breaking a real implementation, because it’s a different way in, to engage with the subject matter. We’re usually taught the defensive “blue team” way, but this would be learning consensus the adversarial “red team” way.

                          So this is like the distributed systems equivalent of a security bug bounty program. If you can find a correctness bug that violates strict serializability or leads to data loss then you could earn a bounty of up to $3,000.

                          Consensus protocols are notoriously difficult to get right, so this is exciting for a bug bounty program, but TigerBeetle also has a challenging storage fault model, where the consensus protocol needs to survive scenarios where even writes to disk might be a no-op, or where writes (or reads) might be misdirected and written to (or read from) the wrong disk sector.

                          But to make it really realistic, our team are also providing some cool “red team” tools: e.g. a deterministic fault injection fuzzing simulator that can do state checking of all state transitions, and then replay anything interesting verbatim over-and-over so you can apply your skill to figure it out.

                          No pressure! ;)

                          1. 2

                            Most of the time, Network Time Protocol (NTP), would correct our clocks for these errors. However, if NTP silently stops working because of a partial network outage, and if [database] keeps transacting, then we would be running blind

                            Maybe an ignorant suggestion, but this excerpt suggests that as long as NTP is working correctly then their database doesn’t need to worry. If that’s so, perhaps they could write a bulletproof NTP daemon and solve this issue once?

                            1. 2

                              The problem comes in where the network fails (or partitions itself) in different ways for the NTP daemon versus for TigerBeetle, especially because NTP is hierarchical whereas TigerBeetle is a 3 or 5 node cluster. We need to keep the partition faults aligned, which is why we need a detection mechanism aligned with TigerBeetle’s consensus so it can stop when NTP (or any clock sync service) effectively stops (as seen by no agreement on cluster time amongst the cluster clocks). It’s purely a fail-safe mechanism, that’s complementary to the clock sync service. Partitions are always possible, so we can’t write a bulletproof NTP daemon (we’re also not trying to replace the clock sync service). At the same time, some partitions may only affect a minority of the cluster whereas other partitions may affect a majority of the cluster. There’s no need to shutdown the cluster if a majority of nodes have good clocks and only a minority of nodes are partitioned.

                            1. 9

                              I have to admit, I did raise an eyebrow at the idea that Tigerbeetle’s correctness relies on well-behaved wall clocks during VM migration–in a database which also explicitly claims strict serializability. Viewstamped replication is an established consensus protocol, so that’s not a red flag per se, but we know from Lamport 2002 that consensus requires (in general) at least two network delays. However, Tigerbeetle claims “We also can’t afford to shut down only because of a partial network outage”, which raises some questions! Do they intend to provide total availability? We know it’s impossible to provide both strict serializability and total availability in an asynchronous network, but in a semi-sync model, maybe they can get away with something close, ala CockroachDB, where inconsistency is limited to a narrow window (depending on clock behavior) before nodes shut themselves down. Or maybe they’re only aiming for majority-available (which is what I’d expect from VR–it’s been a decade or so since I’ve read the paper and I only dimly remember the details), in which case… it’s fine, and the clocks are… just an optimization to improve liveness? I dunno, I’ve got questions here!

                              1. 11

                                Thanks for the awesome questions, @aphyr!

                                The clocks are… just an optimization for the financial domain of the state machine, not for the consensus protocol.

                                As per the post, under “Why does TigerBeetle need clock fault-tolerance” we explain that “the timestamps of our financial transactions must be accurate and comparable across different financial systems”. This has to do with financial regulation around auditing in some jurisdictions and not with total order in a distributed systems context.

                                As per the talk linked to in the post, we simply need a mechanism to know when PTP or NTP is broken so we can shut down as an extra safety mechanism, that’s all. Detecting when the clock sync service is broken (specifically, an unaligned partition so that the hierarchical clock sync doesn’t work, but the TigerBeetle cluster is still up and running) is in no way required by TigerBeetle for strict serializability, it’s pure defense-in-depth for the financial domain to avoid running into bad financial timestamps.

                                We also tried to make it very clear in the talk itself that leader leases are “dangerous, something we would never recommend”. And on the home page, right up front (because it’s important to us) we have: “No stale reads”. If you ever do a Jepsen report on TigerBeetle, this is something I guarantee you will never find! :)

                                You might also recall that I touched on this in our original email discussion on the subject back on the 2nd July, when I suggested that CLOCK_BOOTTIME might be better for Jepsen to recommend over CLOCK_MONOTONIC going forward, and where I wrote:

                                “If you’re curious why we were looking into all of this for TigerBeetle: We don’t plan to risk stale reads by doing anything dangerous like leader leases, but we do want a fault-tolerant clock with upper and lower bounds because […] these monotonic transaction timestamps are used […] for timing out financial two-phase commit payments that need to be rolled back if another bank’s payments system fails (we can’t lock people’s liquidity for years on end).”

                                You can imagine TigerBeetle’s state machine (not the consensus protocol itself) as processing two-phase payments that look alot like a credit card transaction that has a two-phase auth/accept flow, so you want the leader to timestamp roughly close enough to true time, so that the financial transaction either goes through or ultimately gets rolled back after roughly N seconds.

                                Hope that clarifies everything (and gets you excited about TigerBeetle’s safety)!

                                P.S. We’re launching a $20K consensus challenge in early September. All the early invite details are over here and I hope to see you at the live event, where we’ll have back-to-back interviews with Brian Oki and James Cowling:

                                1. 3

                                  This has to do with financial regulation around auditing in some jurisdictions

                                  Aha, thank you, that makes it clear. You’re still looking at potentially multi-second clock errors, right? It’s just that they’re limited to the window when the VM is paused, instead of persisting indefinitely?

                                  that leader leases are “dangerous, something we would never recommend”

                                  You know, I’m delighted you say this, because this is something I’ve discussed with other vendors and have encountered strong resistance to. Many of my clients have either felt the hazard didn’t exist or that it’s sufficiently improbable to worry about. I still don’t have any quantitative evidence to point to which suggests how frequently clock desync and/or skew occurs, or what kinds of user-facing impact resulted. I think CockroachDB choosing to up their default clock skew threshold to 250 (500?) ms was suggestive that something was going wrong in customer environments, and I’ve had private discussions to the effect of “yeah, we routinely see 100+ ms offsets in a single datacenter”, but that’s not something I can really cite. If y’all have done any sort of measurement to this effect, I’d love to read about it.

                                  in our original email discussion

                                  My terrible memory strikes again! Yes, I do recall this now, thank you. :-)