1. 7
  1. 3

    Thanks to the author for an excellent post!

    I have to caveat this post with the fact that we are ignoring large swathes of failure modes.

    I work on a database [1] that has a strict storage fault model, and so I thought I’d add a few of these failure modes to the discussion, for example where I/O operations to and from disk may be: corrupted, misdirected to the wrong sector when writing or reading, rendered impossible because of a latent sector error, no-ops, performing no actual I/O while at the same time reporting success, significantly delayed, sometimes by several seconds (gray failure).

    Beyond the obvious:

    A corrupt log checksum does not always imply a system crash, fsync() on macOS [2] does not in fact flush to the disk “platter”, and data may be corrupted at any time during its storage lifecycle (before being written, while being written, after being written, and while being read).

    Disk sector writes are also not atomic, for example, an Advanced Format 4096 byte sector write to a disk with an emulated logical sector size of 4096 bytes but a physical sector size of 512 bytes is not atomic, and would be split into 8 physical sector writes, which may or may not be atomic.

    The Linux kernel page cache is not always reliable and may misrepresent the state of data on disk after an EIO or latent sector error, and file system metadata (such as the size of the log file as stored in the inode) as presented by the kernel is not always reliable and may change at any time (the inode can simply be corrupted on disk). There should always be redundant storage of critical metadata such as the log file size—storing this out of band and not trusting in inodes or the filesystem. You need to treat the disk as a block device essentially.

    However, the solution to most of these problems is “use checksums,” which is not that exciting, which is why I’m omitting it from this discussion.

    A better solution I believe is simply to start with a clearly defined storage fault model and to lean on the storage fault research that’s out there. In fact, the research suggests that systems that reduce the problem space largely to checksums do not tend to get storage faults right, and checksums (while critical) are the least of the solution.

    In fact, a large part of the solution after a clear model for safety is to use O_DIRECT [3], and to design protocols (and the whole system) so that they are storage fault-aware [4]. This means even as far out as the distributed consensus layer. Local storage faults propagate across distributed systems and break formal proofs of correctness for protocols such as Paxos and Raft [4], because these assume that stable storage is non-byzantine, so that replicated redundancy does not in fact imply fault-tolerance [5]. A single disk sector failure on one machine can result in global data loss or cluster unavailability.

    One also has to be careful to disentangle crashes from corruption when recovering from the log — to know whether to truncate or to repair — the checksum alone can’t make this distinction [4], and care needs to be taken when handling read/write syscall return values that indicate partial sector input/output especially where this interacts with different physical/logical sector sizes [6]. Another thing that’s critical for successful log recovery is to be able to read “around” faulty sectors to preserve durability and surface as much data as possible to higher layers [7].

    Checksums also need to be daisy-chained or hash-chained, like ZFS does, all the way back to the copy-on-write superblock. Otherwise, a read may simply read the wrong block without getting a checksum failure. There also needs to be proactive scrubbing of disks to detect and repair storage faults before they take out all copies. Checksums are of limited use if there’s no scrubbing, no fault detection in the first place.

    Cool, our hypothesis has been validated. This should be easy to fix: if we sync the log when we open it, we should guarantee that no unsynced reads are ever served

    Here’s an example of this in Zig [8] that Alex Miller of FoundationDB pointed out to me.

    If we want to improve our throughput, we need to find a way to turn multiple logical writes (Commands) into a single physical “batch.”

    This is also our own experience, and how TigerBeetle is able to process more than a million financial transactions a second. We reduce the problem down to a matter of: how fast can we write sequentially to disk, and how fast can we send data across the network, while maintaining stability and working around slowly failing hardware.

    The easier way to think about this is that the MEMTABLE (or some other indexed structure) is the afterthought. It’s just the cache in front of where the real action is: the log.

    This is where TigerBeetle is at now. We literally have a deterministic MEMTABLE—deterministic so that we can do autonomous testing, ala FoundationDB—in front of the consensus log. Every consensus operation, we ratchet the LSM-Tree a little, deterministically, so that compaction is incremental and streaming without any foreground latency stalls, avoiding the need for write throttling. [9]

    We’re coding this live on Twitch everyday (in Zig!) if you’d like to drop by and chat!

    [1] https://www.tigerbeetle.com

    [2] fsync on macOS flushes to disk cache only — https://twitter.com/TigerBeetleDB/status/1422491736224436225

    [3] “Can Applications Recover from fsync Failures?” — https://www.usenix.org/system/files/atc20-rebello.pdf

    [4] “Protocol-Aware Recovery for Consensus-Based Storage” — https://www.usenix.org/system/files/conference/fast18/fast18-alagappan.pdf

    [5] “Redundancy Does Not Imply Fault-Tolerance” — https://www.usenix.org/conference/fast17/technical-sessions/presentation/ganesan

    [6] Partial logical sector reads/writes even when using O_DIRECT — https://github.com/coilhq/tigerbeetle/blob/main/src/storage.zig#L40-L51

    [7] Reading around faulty sectors — https://github.com/coilhq/tigerbeetle/blob/main/src/storage.zig#L141-L147

    [8] Opening a file safely — https://github.com/coilhq/tigerbeetle/blob/main/src/storage.zig#L390-L394

    [9] Introducing TigerBeetle’s LSM-Forest — https://www.youtube.com/channel/UC3TlyQ3h6lC_jSWust2leGg