1. 16
    1. 62

      Please don’t pay attention to this article, it’s almost completely wrong. I think the author doesn’t know what he’s talking about. I will go point by point:

      • Out-of-tree and will never be mainlined: Please don’t use zfs-fuse, it’s long been abandoned and OpenZFS is better in every respect. The rest of the points I guess are true (or might be, eventually, sure).
      • Slow performance of encryption: This seems to be completely wrong. I believe OpenZFS re-enabled vector instructions with their own implementation of the kernel code that can no longer be used. For an example, see https://github.com/openzfs/zfs/pull/9749 which was merged many months after the Linux kernel disabled vector instructions.
      • Rigid: This was done deliberately so people like the author don’t shoot themselves in the foot. It would actually have been easier to make the vdev hierarchy more flexible, but ZFS is more strict on purpose, so users don’t end up with bad pool configurations.
      • Can’t add/remove disks to RAID: I guess this is still true? I’m not entirely sure because I’m not following OpenZFS development closely nor do I use RAID-Z.
      • RAID-Z is slow: As far as I know this is correct (in terms of IOPS), so RAID-Z pools are more appropriate for sequential I/O rather than random I/O.
      • File-based RAID is slow: OpenZFS can now do scrubs and resilvers (mostly) sequentially, so this point is wrong now.
      • Real-world performance is slow: I wouldn’t call it slow, but ZFS can be slower than ext4, sure (but it’s also doing a lot more than ext4, on purpose, such as checksumming, copy-on-write, etc).
      • Performance degrades faster with low free space: The free-space bitmap comment is just weird/wrong, because ZFS actually has more scalable data structures for this than most other filesystems (such as ext4). It might be true that ZFS fragments more around 80% utilization than ext4, but this is probably just a side-effect of copy-on-write. Either way, no filesystem will handle mostly full disks very well in terms of fragmentation, so this is not something specific to ZFS, it’s just how they (have to) work.
      • Layering violation of volume management: This is completely wrong. You can use other filesystems on top of a ZFS pool (using ZVols) and you can use ZFS on top of another volume manager if you want (but I wouldn’t recommend it), or even mix it with other filesystems on the same disk (each on their own partition). Also, you can set a ZFS dataset/filesystem mountpoint property to legacy and then use normal mount/umount commands if you don’t like ZFS’s automounting functionality.
      • Doesn’t support reflink: This is correct.
      • High memory requirements for dedupe: The deduplication table is actually not kept in memory (except that a DDT block is cached whenever it’s read from disk, as any other metadata). So as an example, if you have some data that is read-only (or mostly read-only) you can store it deduped and (apart from the initial copy) it will not be any slower than reading any other data (although modification or removal of this data will be slower if ZFS has to keep reading DDT blocks from disk due to them being evicted from cache).
      • Dedupe is synchronous: Sure it’s synchronous, but IOPS amplification will mostly be observed only if the DDT can’t be cached effectively.
      • High memory requirements for ARC: I don’t even know where to begin. First of all, the high memory requirements for the ARC have been debunked numerous times. Second, it’s normal for the ARC to use 17 GiB of memory if the memory is available otherwise – this is what caches (such as the ARC) are for! The ARC will shrink whenever memory is otherwise needed by applications or the rest of the kernel, if needed. Third, I use OpenZFS on all my machines, none of them are exclusively ZFS hosts, and there is exactly zero infighting in all of them. Fourth, again, please just ignore zfs-fuse, there is no reason to even consider using it in 2022.
      • Buggy: All filesystems have bugs, that’s just a consequence of how complicated they are. That said, knowing what I know about the ZFS design, code and testing procedures (which is a lot, although my knowledge is surely a bit outdated), I would trust ZFS with my data above any other filesystem, bar none.
      • No disk checking tool: This is actually a design decision. Once filesystems get too large, fsck doesn’t scale anymore (and it implies downtime, almost always), so the decision was made to gracefully handle minor corruption while the machine is running and being used normally. Note that a badly corrupted filesystem will of course panic, as it likely wouldn’t even be possible to recover it anymore, so it’s better to just restore from backups. But you can also mount the ZFS pool read-only to recover any still-accessible data, even going back in time if necessary!

      In conclusion, IMHO this article is mostly just FUD.

      1. 21

        This is actually a design decision.

        A question on my mind while reading this was whether or not the author knows ZFS enough to be making some of these criticisms honestly. They seem like they should, or could. I am not attacking their intelligence, however I would prefer to see a steelman argument that acknowledges the actual reasons for ZFS design choices. Several of the criticisms are valid, but on the topic of fsck, ARC and layering the complaints appear misguided.

        I spent 7 years using the solution they recommend (LUKS+btrfs+LVM) and have been moving to ZFS on all new machines. I’ll make that a separate top-level comment, but I wanted to chime in agreeing with you on the tone of the article.

      2. 7

        I’m not sure the check tool is really not needed. It’s not something I want to run on mount / periodically. I want a “recovery of last resort” offline tool instead and it doesn’t have to scale because it’s only used when things are down anyway. If there’s enough use case to charge for this (https://www.klennet.com/zfs-recovery/default.aspx) there’s enough to provide it by default.

        1. 5

          In general we try to build consistency checking and repair into the main file system code when we can; i.e., when doing so isn’t likely to make things worse under some conditions.

          It sounds like what you’re after is a last ditch data recovery tool, and that somewhat exists in zdb. It requires experience and understanding to hold it correctly but it does let you lift individual bits of data out of the pool. This is laborious, and complicated, and likely not possible to fully automate – which is why I would imagine many folks would prefer to pay someone to try to recover data after a catastrophe.

      3. 5

        Dedup does generally have high memory requirements if you want decent performance on writes and deletes; this is a famous dedup limitation that makes it not feasible in many situations. If the DDT can’t all be in memory, you’re doing additional random IO on every write and delete in order to pull in and check the relevant section of the DDT, and there’s no locality in these checks because you’re looking at randomly distributed hash values. This limitation isn’t unique to ZFS, it’s intrinsic in any similar dedup scheme.

        A certain amount of ZFS’s nominal performance issues are because ZFS does more random IOs (and from more drives) than other filesystems do. A lot of the stories about these performance issues date from the days when hard drives were dominant, with their very low IOPS figures. I don’t think anyone has done real performance studies in these days of SSDs and especially NVMe drives, but naively I would expect the relative ZFS performance to be much better these days since random IO no longer hurts so much.

        (At work, we have run multiple generations of ZFS fileservers, first with Solaris and Illumos on mostly hard drives and now with ZoL on Linux on SATA SSDs. A number of the performance characteristics that we care about have definitely changed with the move to SSDs, so that some things that weren’t feasible on HDs are now perfectly okay.)

    2. 26

      From 2013 to 2020 I used this article’s recommended solution, in various configurations: mostly LUKS with btrfs, sometimes LVM, sometimes mdadm, sometimes letting btrfs handle it, and all sorts of scripts to handle the rest. Although performance was not always amazing, I never had problems with it. This included years when people would swear btrfs was going to eat my children. It was always a bit of a burden to set up, monitor and maintain, but I got used to it. I use ZFS now, but I still have a few things that I like better about LUKS+btrfs, and some of them are mentioned in this article.

      I can comfortably use either, and I do not feel particularly partisan in the matter.

      What tipped the scales for me was threefold: ZFS’ encrypted send/recv, the ease of tooling around it, and the overall approach to on-line data integrity (as explained by @wizeman). Combined, these now feel like “table stakes” for me to consider a combination storage-and-backup solution.

      To a lesser extent (i.e. not a deciding factor), I have enjoyed things like ZED, which I believe stems from what the article calls a “layering violation”. I hated that at first, honestly. It sounds silly, but after a near-decade of the previous tools, the idea of not lego-ing together block devices in my own bespoke manner was a big mental barrier. In the long run it has not mattered, and seems to be the source of features that I benefit from.

      The only point I need to individually disagree with from the article is when they say “checksumming is usually not worthwhile”, I do not agree. Not at all. Even with ECC memory, the number of random hardware issues I have seen has taught me that I want every tool imaginable at my disposal, at every link in the chain. This isn’t just about RAM or drives, either: bad SATA cables, overheating HBAs, transient voltage issues on a PCH, oh my! Name any minor component between your CPU and the storage media, and it can go wrong there! For anyone who has not used smartd, I recommend checking that out as well.

      As a final caveat, my use cases mean I do not use dedup or compression. Whether they are a benefit, a detriment, or are better solved other ways are not part of my calculus. They are not reasons that I use ZFS, but they also are not reasons I would avoid it. It strikes me as flawed reasoning to say “ZFS offers compression, but compression is often not useful” as a reason to avoid ZFS.

    3. 18

      Others have already pointed out the technical problems, but this:

      ZFS is both a filesystem and a volume manager.

      Yes, that was the whole point. Some of its initial designers had specialized in virtual memory subsystems before coming to file systems. You wouldn’t expect to have to configure and mount different sticks of RAM, would you? The idea that the volume manager and the filesystem need to be separate layers is historical artifact, not anything fundamental. Intel architectures don’t know how to boot off ZFS managed disks, so you have to have a disk with a partition table to boot off of, but for other disks you don’t even need a partition table. That’s a limitation of the Intel architecture, where ZFS was never a big player. Sun systems knew how to boot off them.

      1. 18

        Since ZFS, precisely one new general-purpose filesystem has made it to the point of being deployed at scale: APFS. APFS has very similar layering to ZFS (though the lower two layers of ZFS are merged into the container layer in APFS). To me, this is very strong evidence that the designers of ZFS had the right ideas.

      2. 10

        I’d like to underline that the layering violation was done in order to close the RAID5 write hole https://en.wikipedia.org/wiki/RAID#Atomicity

    4. 20

      Out-of-tree and will never be mainlined

      Slow performance of encryption

      I think these are problems with linux, not about zfs.

      Phoronix benchmarks

      Are not rigorous, and I would not trust them, broadly speaking.

      If you use ZFS’s volume management, you can’t have it manage your other drives using ext4, xfs, UFS, ntfs filesystems

      That’s not true. You can run any FS you like on a zvol.

      No disk checking tool (fsck)

      Scrub?


      I also find it rather strange to complain about bugs and them recommend btrfs, but maybe that’s just me…

      1. 3

        Out-of-tree and will never be mainlined

        Slow performance of encryption

        I think these are problems with linux, not about zfs.

        I don’t know anything about ZFS encryption speed, but the license is gratuitously not GPL-compatible. Fortunately one may conjoin ZFS and Linux on one’s own machine, but it would be nice were it supportable out of the box. The steps to install with a ZFS root are more complex than is desirable.

        1. 2

          People often parrot that it’s (gratuitously, really?) incompatible a lot, but it seems like Canonical, at least, don’t agree. As far as I can tell they’re shipping ZFS binaries in their distribution – and more power to them!

          I personally find it hard to imagine how one could conceive of a file system, constrained as it is to a loadable kernel module and developed originally for a wholly unrelated operating system, to be a derived work of Linux. Until someone takes Canonical to court we just don’t know if the two licences are actually incompatible for this particular case.

    5. 8

      Buggy

      No disk checking tool (fsck)

      I was going to do this in order, but let’s skip ahead to these first. I’ve run a couple 10TB+, everyday-use ZoL pools at home for over ten years (since mid-2011), with several disk replacements along the way (some for failed disks, some to increase capacity / move to SSD). It hasn’t lost a single byte of my data in that time. It hasn’t needed a fsck in that time; no crash has ever left it in an inconsistent state that it couldn’t simply reconcile on boot. And because I scrub weekly I have reasonable confidence that it really is all there in one piece. In that regard it’s been better than any other fs I’ve used… particularly btrfs, which took less than one week to make an irrecoverable mess when I put it on a laptop to play around with.

      Out-of-tree and will never be mainlined

      Has not caused me a problem.

      Rigid

      In other words, it only offers sensible options.

      Can’t add/remove disks to a RAID

      Mildly inconvenient, yes.

      RAIDZ is slow

      Has not caused me a problem.

      For operations such as resilvering, rebuilding, or scrubbing, a block-based RAID can do this sequentially, whereas a file-based RAID has to perform a lot of random seeks. Sequential read/write is a far more performant workload for both HDDs and SSDs.

      In recent versions, scrub does mostly sequential work, after doing a bunch of random reads up front to collect necessary metadata. This has made it quite a bit faster.

      It is even worse on SMR drives, and Ars Technica blame the drives when they probably should have blamed ZFS’s RAID implementation.

      SMR is really bad in many usecases that are not tweaked by ZFS, and if you’re worried about the kind of performance issues that the earlier points raise, you definitely don’t want to be using SMR.

      (Just for comparison, mdadm works perfectly fine with these drives.)

      mdadm is also doing strictly less in terms of validating your data integrity.

      High memory requirements for dedupe

      For all intents and purposes, this online deduplication feature may as well not exist.

      Deduplication generally offers neglegible savings unless perhaps you’re storing a lot of VM disk images of the same OS.

      Yup, widely acknowledged. You don’t need it, don’t use it. It’s there because a few people have setups that benefit strongly from it, and there’s no compelling reason to take that away from them. Everyone else… don’t turn it on and it won’t hurt you.

    6. 4

      I agree with wizeman’s comment, but also ZFS is now a reasonable replacement for FAT32. ZFS can run on Windows, MacOS, Linux, BSD, etc and you can format removable drives with ZFS (flash, spinning, rust whatever) and it all just works across all platforms.

      The only real downside is ZFS is not installed by default on many of these OS’s, which makes it a bit harder to use for removable media, but assuming you control the nodes in which you want to use said drives, ZFS is awesome in this use-case.

      1. 6

        Not yet on OpenBSD, or Net, or I think Dragonfly…

        If it works on Windows and OSX it’s nothing I’ve heard of beyond simple tests or janky capabilities which are not, in my mind, worthy of trust just yet.

        1. 7

          I’ve run it on Windows and MacOS without issues, but I’m not using ZFS for the root partition either, but certainly they are not mainstreamed into OpenZFS project yet(though both are working on it as far as I’m aware).

          I agree it’s not on ALL the BSD’s. NetBSD has it, to some degree. FreeBSD and HardenedBSD include it and allow it’s use by default. It is on Solaris and the open versions of that OS.

          OpenBSD will probably never get ZFS, I would guess even if some magic team showed up and delivered amazing quality patches, they still wouldn’t accept it.

          Dragonfly, they might accept patches, I dunno, but I doubt they would do the work themselves, they are focused on HAMMER.

      2. 3

        Remember the days when you could triple boot into Mac/Windows/Linux and symlinks would make your Firefox and Thunderbird profiles show up from a FAT32 shared drive? That was just magic to me. (I only did it with windows/Linux, but maybe triple boot was possible?)

        Harkening back to that, I always have a pool that I mount in both macOS and Linux. I don’t do the magic symlinks anymore, but I’m lazier than the price of excess storage in a world of Firefox Sync. Everything else is zsh/vim/tmux dotfiles that perfectly translate.

      3. 2

        Windows? Since when?

        1. 5

          https://openzfsonwindows.org/ and their github

          I haven’t used it on Windows in over a year, but it worked fine for me then. I dunno how long it’s been around.. a while?

    7. 4

      Can’t add/remove disks to a RAID

      You can expand ZPOOL with RAIDZ Expansion:

      By the way the author sounds like he is in love with LVM and Linux ecosystem and does not understand ZFS at all … while he also recommends BTRFS on LVM for the same features.

      If you do now know ZFS do not read it - it will only bring false information into your mind. If you know ZFS then you can go read an laugh to make your mood better :)

    8. 3

      I haven’t decided yet if I agree with the conclusion of this article. I recently set up a new bare-metal server using ZFS on Linux, and the fact that the ARC is separate from the normal Linux page cache was bugging me. I found this article while looking for information on possible implications of that issue. I do like ZFS’s unification of filesystem and volume manager (the “layering violation”), the snapshot support, and of course, the emphasis on data integrity.

      1. 6

        Not sure about Linux but on FreeBSD the buffer cache was extended to allow it to contain externally-owned pages. This means that ARC maintains a pool of pages for disk cache (and grows and shrinks based on memory pressure in the system) and the same pages are exposed to the buffer cache. Prior to this (I think this landed in FreeBSD 10?), pages needed to be copied from the ARC into the buffer cache before they could be used to service I/Os and so you ended up with things being cached twice. Now, ARC just determines which pages remain resident from the storage pool.

        1. 3

          On Linux, the free command counts the ARC under “used”, not “available” like the standard Linux page cache.

          Having read all of the comments here, I think I’ll stick with ZFS. But I still need to find out if there are any unusual high-memory-pressure situations I should watch out for.

          1. 7

            FYI, here’s what I’ve observed regarding memory pressure after many years of running OpenZFS on Linux on multiple machines, each with multiple ZFS pools, some using hard disks and others using SSDs (and some using HDDs with SSDs as caches), and also including ZFS as the root filesystem on all systems:

            • Many years ago I ran into a situation where under heavy memory pressure the ARC would shrink too much. This caused the working set of ZFS metadata and/or commonly accessed data to constantly being evicted from cache and therefore the system appeared to almost hang because it was too busy doing I/Os of recently-evicted blocks instead of making useful progress. The workaround was simple: set zfs_arc_min parameter to force arc_c_min to be 1/16th of the system memory (which is 6.25%), rather than the default 1/32th of system memory (which is 3.125%). As an example, on a Raspberry Pi with 8 GiB of memory I add the zfs.zfs_arc_min=536870912 Linux kernel parameter on my boot/grub config (which sets the minimum ARC size to be 512 MiB rather than the default 256 MiB) and on a server with 128 GiB I add zfs.zfs_arc_min=8589934592 (i.e. 8 GiB rather than the default 4 GiB), etc (you get the idea). Since I did this, I never observed this behavior anymore, even under the same (or different) stressful workloads.
            • On recent OpenZFS versions, on my systems, I’ve observed that the ARC growing/shrinking behavior is flawless compared to a few years ago, even under heavy memory pressure. However, the maximum ARC size is still too conservative by default (50% of memory). This meant that under low memory pressure, the ARC would never grow beyond 64 GiB of memory on my 128 GiB server, leaving almost 64 GiB of memory completely unused at all times (instead of using it as a block cache for more infrequently used data or metadata). So I just add another kernel parameter to set the maximum ARC size to be 90% of system memory instead. e.g. On my 8 GiB Raspberry Pis I add zfs.zfs_arc_max=7730941132 (i.e. 7.2 GiB) and on my 128 GiB server I add zfs.zfs_arc_max=123695058124 (i.e 115.2 GiB). Since then, the ARC is free to grow more and use otherwise unused memory as a cache.
            • Although strictly not a memory issue, many years ago I also added the zfs.zfs_per_txg_dirty_frees_percent=0 parameter, which disables some throttling which was causing me problems, although I don’t remember the exact details. This might not longer be necessary (not sure).

            Apart from that I haven’t observed any other issue. Although please have in mind that I use zram swap on all my systems and I don’t use dedup, RAID-Z or ZFS encryption (I use LUKS encryption instead), etc. So your mileage might vary depending on your system hardware/software config and/or your workloads, especially if they are somewhat extreme in some way.

          2. 1

            If you find out, will you leave a link here or PM me? I too do not have a good mental model here and that slightly worries me.

      2. 2

        and the fact that the ARC is separate from the normal Linux page cache was bugging me

        How separate is it? I know that /proc/sys/vm/drop_caches will drop it in the same manner as the normal page cache.