1. 30

    1. 10

      I recently built a home server and decided to go with ZFS for the first time, having mostly used LVM and traditional RAID.

      ZFS is really fucking awesome. Incredibly simple to get started, easy CLI, Just Works. If you have more than two disks you should be using ZFS.

      1. 6

        It’s one of the main reasons that I default to FreeBSD. ZFS works out of the box, as root, and other things integrate cleanly with it (e.g. freebsd-update manages ZFS boot environments so you can roll back a failed update). Using systems without ZFS feels very ’90s, and not in a good way.

        1. 4

          If one wants to optimize entirely for the most native ZFS experience, I suppose FreeBSD or Illumos might be the best choice. If one wants Linux (although I’m not saying one should), I can attest that NixOS has very good ZFS support (out of the box, as the root; at least major NixOS developer uses it) and that ZFS (as the root filesystem) also (at least c. 2015) worked well on Gentoo (one Gentoo developer also being a ZFS developer), although “out of the box” is not an adjectival phrase to apply to Gentoo. Among more mainstream distros, I hear Ubuntu also has ZFS support out of the box, although I haven’t used it.

          1. 7

            (out of the box, as the root; at least major NixOS developer uses it) and that ZFS (as the root filesystem) also (at least c. 2015)

            Can confirm that ZFS is generally quite popular in the NixOS community as you just need to set a few flags for support and get kernels with zfs from binary caches without having to fiddle with DKMS or so. it also works well with NixOS rollback mechanisms, as you can snapshot your stateful directories before and/or after upgrade. Enabling things such as auto-scrubbing and auto-snapshotting are also just a nixos option away, each.

          2. 2

            Adding another datapoint: ZFS on NixOS works great for making boring network storage appliances and NASes.

        2. 4

          That is a funny usecase for ZFS (not wishing to feel 90s’). ;-)

          1. 4

            Well, the primary use case for ZFS is feeling smug, the secondary use case is preventing data loss. Not living in the ‘90s is just a nice bonus.

      2. 3

        For me it’s the most reasonable choice with even a single disk – comment commited from a ThinkPad T14s running FreeBSD 15-CURRENT

        1. 2

          I agree; since I started using ZFS I’ve used it on all my computers with FOSS OSes, all of which have one disk each (comment submitted from a phone with no disk and no ZFS).

    2. 3

      I ran a root zfs on Ubuntu after having too many data loss issues with btrfs. It went great. I’ve only had one drive ever fail on me and the user experience of detecting that failure and replacing the drive was actually really nice.

      My only caveat is that for current generation of nvme pcie4 drives it’s really easy to have the COW or encryption or compression become a bottleneck. Even hardware accelerated AES instructions have trouble keeping up with the throughput of modern storage. So for that reason I don’t currently run zfs on my laptops root volume.

      1. 3

        I’ve not had that experience with NVMe, tbh, I can sature a 4GB/s NVMe with ZFS, you just have to tweak it right to handle such fast disks correctly ie, by tuning down the ZLOG-in-RAM size abit. For encryption on fast NVMes, if possible I would just say, use the HDD encryption mode and with TPM thrown in if possible, if the NVMe can do Opal that should handle it.

        My bigger gripe is that ZFS got it’s pitfalls, if you don’t tune or handle it right, ZFS becomes very unpleasant very quickly.

        1. 1

          Fair enough. I spent a day or so playing with it but am by no means an expert. IIRC I did try having a ZIL and L2arc. I don’t think I did zlog in ram explicitly. I did actually end up using the HDD encryption module though.

          1. 1

            If you don’t have a ZIL/ZLOG device, it’ll be in RAM. They are write-only devices used for ensuring async-write crash recover (Async Write is confirmed once written to ZLOG/ZIL device, otherwise it’s gotta wait for being written the the LOG on some device on the array) which can lead to issues with write amplification. Disabling or down-sizing the ZIL/ZLOG can help with performance by not having this double-write or limiting how many there are.

            ZFS is one of those things where you basically gotta read through like, a billion things on the documentation about the options you can apply to even begin to work with it efficiently (which is bad).

    3. 2

      As a user, why should I care about ZFS? I mean, as long as I can write and read my files I’m happy. Why should I care about the specific implementation details of my filesystem?

      (To be clear, I am not trying to talk down this post or ZFS enthusiasts, this is a genuine question.)

      1. 8

        One reason is that it does end-to-end checksums of your data and periodically scans it all. If anything in the system corrupts data on its way to the disk (or afterward), it will be caught, and — assuming you have set up redundancy of some sort — repaired.

        My photo archives are in ZFS with triple mirrored disks, and even though it may be 10 years before I want to look at a file next, I’m pretty sure it will be there.

      2. 4

        So that you can reliably read and write your files. And detect when you can’t (when the disk starts corrupting data).

      3. 4

        The biggest user-visible feature is constant-time snapshots. If you set the right config option, these are mounted in .zfs in the filesystem’s root. I normally set up FreeBSD machines with each home directory as a separate ZFS filesystem and with delegated administration so that users can snapshot and delete snapshots for their own home directory. This means that you can create arbitrary snapshots and extract files from them, so you can run rm -rf in your home directory without fear: if you did it in the wrong place, you can always undo it. On top of this, there are a few things that may or may not matter, depending on what kind of user you are:

        • Creating filesystems with different properties (including different backup policies) is very easy. I have a builds filesystem mounted in my home directory. This has sync behaviour turned off, which may cause data loss if the machine crashes, but gives better write performance. As a separate filesystem, it’s also trivial to exclude from backups. Everything in there is a build product, so can be recreated from the source. It’s also not part of snapshots of my home directory, so I don’t waste disk space with old versions of .o files.
        • You can do complete and incremental backups of a filesystem with zfs send and zfs receive. These work with snapshots, so you create snapshots and can then send them as deltas to replicate the filesystem elsewhere. You can script this kind of thing yourself or use something off the shelf like zrepl that gradually ages snapshots (deleting intermediate ones) and replicate them on another machine. ZFS also supports bookmarks for the remote replication case. They are metadata-only snapshots, which let a send stream identify which blocks have changed but don’t keep the old versions around and so are not usable for rollback and so take very little disk space. Oh, and you can atomically snapshot a set of filesystems, if you need consistency across mount points.
        • If you use containers, both Podman and containerd have a ZFS snapshotter, which extracts each image layer and snapshots the result. This gives block-level sharing for things that are the same in the parent layer but (unlike layered FS approaches) doesn’t incur overhead when looking up files in a directory. You can also enable deduplication if you want a bit more space saving at the expense of performance.
        • At least on FreeBSD and Illumos, snapshots and rollbacks are integrated with system updates via boot environments. Boot environments snapshot the root filesystem and let you select an alternative one in the loader. If an update goes horribly wrong, you can roll back and try again without losing data.
        • As far as I know, ZFS is still best in class for data preservation. Every block has an on disk hash that lets the filesystem detect single-block corruption (accepting to Google’s published data, that happens periodically with modern disks). On a single-disk install, you can set copies=2 or 3 for high-importance filesystems (e.g. your home directory) so single-block failures will be recoverable. For larger installs, the replication layer knows about the on-disk data so recovering from a failed disk requires copying the existing data, not all data that might be there. RAID-Z is like RAID-5 but without the write hole (write to one or two disks in a three-disk set and now the block fails checksums but you can’t tell why, with ZFS that’s all transactional).
        • Transparent compression with zstd or lz4 typically gives a double-digit percentage space saving with no performance cost (even with SSDs and a slow CPU, the CPU is much faster than the disk).

        Once you get used to O(1) snapshots and trivial replication, it’s hard to go back. APFS and bcachefs also have many of these advantages, with different emphasis and different levels of maturity.

        1. 2

          “constant-time” is rather underselling the snapshotting speed: it’s practically instant!

          1. 4

            Constant time is important because it’s not just instant for a small dataset, it’s the same speed for a dataset with 10 TiB of data in it and the same for a filesystem that already has 100 snapshots (this has the downside that it’s easy to forget you have thousands of snapshots that you don’t really need anymore taking up disk space - I realised last week that my NAS has about 3 TiB of snapshots of old things on it).

            A lot of non-CoW filesystems provide snapshotting by adding a layer of indirection when using a filesystem with snapshots that has to go and walk previous snapshots to find missing things. This works in a similar way to union filesystems: you write things on the top layer, if they’re not there then you walk down the layers until you find things. This is fine for a small number of filesystems but it rapidly becomes painful for larger numbers and so you have use cases that gradually degrade performance until it’s unusable. With ZFS, snapshots are built on top of the CoW semantics and so a block referenced by ten snapshots, three clones, and the original filesystem is just a block and finding it is no harder than finding a block that is unique to a single filesystem.

        2. 1

          Great answer, thank you. O(1) (I guess this refers to time complexity, so this means instantly in practice?) snapshots sound like quite a feat and a great tool.

          so you can run rm -rf in your home directory without fear: if you did it in the wrong place, you can always undo it.

          Well, only if you happened to make a snapshot right before running the command, no?

          1. 1

            Well, only if you happened to make a snapshot right before running the command, no?

            Yup, though if you set up regular snapshots you have bounded loss. I’ve seen some people use a decay pattern that keeps one snapshot per minute for the last hour, one per hour for the last day and one per day for the last month. If you do this, the most you can lose is a minute’s worth of work. I generally don’t and just get into the habit of taking a snapshot before I do any cleanup and then deleting the snapshot later. You need to be careful doing that if you’re low on disk because deleting files that are in a live snapshot increases disk usage.

            1. 1

              I’ve seen some people use a decay pattern that keeps one snapshot per minute for the last hour, one per hour for the last day and one per day for the last month.

              As a point of comparison, the defaults on NixOS (if one turns on auto-snapshotting) are to keep one snapshot per quarter-hour for the last hour, one per hour for the last day, one per day for the last week, one per week for the last ~month, and one per month for the last year — so sparser coverage (and less keeping the disk awake) in the short term and more coverage (and more space usage) in the long term.