1. 22
  1. 7

    Not quite through, but I was expecting a quicker read and was pleasantly surprised to find something so thorough! Won’t have time to finish it tonight :p

    Some thoughts, like, “The Final Word” might be overstating things:

    • Cluster FS’s highlight a uniquly different paradigm and approach (perhaps as the EVM does to other languages and runtimes);
    • BTRFS flaunts staggering flexibility over varying disk sizes, configurations, and migrations (re-compress, or convert between RAID paradigms, online, on the fly!);
    • and tbh I don’t know first thing about bcachefs but for that it’s the new kid on the block, and trust that the folks behind it have put their own spin on things (as anyone does when creating!).

    I point these out not to downplay ZFS’s own lovable quirks and the revolutionary impact they’ve had (notably on the lineage of these very systems), but to highlight these and future projects. It’s too soon to underscore “The Final Word”. That said, ZFS is still so worthy of our attention and appreciation! The author has clearly built a fantastic model and understanding of how the system works; I learned much more than I was ready for c:


    One particular section caught my eye:

    If you have a 5TiB pool, nothing wIll stop you from creating 10x 1TiB sparse zvols. Obviously, once all those zvols are half full, the underlying pool will be totally full. If the clients connected to the zvols try to write more data (which they might very well do because they think their storage is only half full) it will cause critical errors on the TrueNAS side and will most likely lead to extensive pool corruption. Even if you don’t over-commit the pool (i.e., 5TiB pool with 5x 1TiB zvols), snapshots can push you above the 100% full mark.

    I thought to myself, “ZFS wouldn’t hit me with a footgun like that, it knows data-loss is code RED”. While “extensive pool corruption” might overzealous, it is a sticky situation. The clients in this case the filesystem’s populating the ZVOLs, which are prepared to run out of space at some point, but not for the disk to fail out from under them. Snapshots do provide protection/recovery-paths from this, but also lower the “over-provisioning threshold”. This isn’t the sort of pool corruption that made me concerned; I couldn’t find any evidence of that. It would obviously still be a disruptive failure of monitoring in production, and might best be avoided precautionarily by under-provisioning, which is a shame, but then even thick provisioning has a messy relationship with snapshots in the following section.

    I’m not sure any known system addresses this short of monitoring/vertical-integration. I guess, it’s great when BTRFS fills up and you can just like plug in a USB drive to fluidly expand available space, halt further corruption, and carry it into degraded performance gracefully. Not that BTRFS’s own relationship with free space is unblemished, but this does work!

    Probably a viable approach in ZFS too (single device zdev?), but BTRFS really shines in it’s UI, responsiveness, and polish during exactly these sorts of migrations, which I’d find relieving in an emergency. ZFS has a lot of notorious pool expansion pitfalls, addressed here, which I also wouldn’t have to think about (even if just to dismiss them as inapplicable under the circumstances bc they relate to zdevs). It matters that I think ZFS can do it, and know that BTRFS can; it’s flexibility is reassuring. (Again, not a dig, still going great lengths to use ZFS everywhere; for all this I don’t run butter right now :p)


    Thinking about it more, this is probably because I’ve recreated BTRFS pools dozens of times, where as ZFS pools are more static and recreating them is often kinda intense. It’s like BTRFS is declarative, like Nix, allowing me to erase my darlings and become comportable with a broader ranges of configurations by being less attached to the specific setup I have at any given time.

    1. 5

      Live replication and sharing are both definitely missing from ZFS, though I can see how they could be added (to the FS, if not to the code). Offline deduplication is the other big omission and that’s hard to add as well.

      For cloud scenarios, I wish ZFS had stronger confidentiality and integrity properties. The encryption, last time I looked, left some fairly big side channels open and leaked a lot of metadata. Given that the core data structure is basically a Merkel tree, it’s unfortunate that ZFS doesn’t provide cryptographic integrity checks on a per-pool and per-dataset basis. For secure boot, I’d love to be able to embed my boot environment’s root hash in the loader and have the kernel just check the head of the Merkel tree.

      1. 4

        Yeah, the hubris of the “last word in filesystems” self-anointment always struck me as fairly staggering. While perhaps not exceeded so quickly and dramatically, it’s a close cousin of “640K ought to be enough for anybody”. Has any broad category of non-trivial software ever been declared finished, with some flawless shining embodiment never to be improved upon again? Hell, even (comparatively speaking) laughably simple things like sorting aren’t solved problems.

        1. 4

          Yeah, the hubris of the “last word in filesystems” self-anointment always struck me as fairly staggering.

          It’s just because the name starts with Z, so it’s always alphabetically last in a list of filesystems.

          1. 3

            Be right back, implementing Öfs.

            1. 1

              Or maybe just Zzzzzzzzzzzfs!

          2. 1

            Yeah, the hubris of the “last word in filesystems” self-anointment always struck me as fairly staggering. While perhaps not exceeded so quickly and dramatically, it’s a close cousin of “640K ought to be enough for anybody”.

            Besides the name starting with Z which @jaculabilis mentioned, I suspect it was also in reference to ZFS being able to (theoretically) store 2^137 bytes worth of data, which does ought to to be enough for anybody.

            Because storing 2^137 bytes worth of data would necessarily require more energy than that needed to boil the oceans, according to one of the ZFS creators [1].

            [1] https://hbfs.wordpress.com/2009/02/10/to-boil-the-oceans/

          3. 1

            Hasn’t BTRFS’s RAID support been unstable for ages? Is it better now?

            1. 4

              AFAIK the RAID5/6 write hole still exists, so that’s the notorious no-go; I’ve always preferred RAID10 myself. Does kinda put a damper on that declarative freedom aspect if mirrors are the only viably stable configuration, but the spirit is still there in the workflow and utilities.

            2. 1

              Article author here, I made an account so I could reply. I appreciate the kind words, I put a lot of effort into getting this information together.

              After talking to some of the ZFS devs, I’m going to clarify that section about overprovisioning your zvols. I misunderstood some technical information and it turns out it’s not as bad as I made it out to be (but it’s still something you want to avoid).

              The claim that ZFS is the last word in file systems comes from the original developers. I added a note to that effect in the first paragraph of the article. I have more info about what (I believe) they were getting at in one of the sections towards the end of the article: https://jro.io/truenas/openzfs/#final_word

              I’m obviously a huge fan of ZFS but I’ll admit that it’s not the best choice for every application. It’s not super lean and high-performance like ext4, it doesn’t (easily) scale out like ceph, and it doesn’t offer the flexible expansion and pool modification like UNRAID. Despite all that, it’s a great all-round filesystem for many purposes and is far more mature than something like BTRFS. (Really, I just needed a catchy title for my article :) )

              1. 1

                Which Cluster FS are you referring to? Google returns a lot of results.

                1. 1

                  The whole field! Ceph, Gluster, & co.