1. 14
    1. 4

      This post obliquely hints at two of my pet peeves with ZFS:

      • The ZFS kernel interface allows you to atomically create a set of snapshots. This is fantastic for package updates if you want, for example, to keep /usr/local/etc and /usr/local/ in separate datasets so that you can back up the config files easily but not bother backing up the whole thing. This feature is so useful that the authors of the zfs command-line tool decided not to expose it to users.
      • The allow mechanism is incredibly coarse-grained. If I want to say ‘my backup user can create snapshots and delete them, but can’t delete snapshots that it didn’t create’ then I’m out of luck: that’s not something that the permissions model can express. Snapshots don’t come with owners. My backup system should be trusted with preserving the confidentiality of my data (it can, after all, read the data unless I’m backing up encrypted ZFS datasets directly) but it shouldn’t have the ability to destroy the datasets that it’s backing up. Yet that’s not something that zfs allow can express.
      1. 3

        Snapshotting /usr/local or /usr/local/etc can be sometimes useful but ZFS Boot Environments is whole a lot better - securing entire system and also both /usr/local and /usr/local/etc - more on that here:

        https://vermaden.files.wordpress.com/2018/11/nluug-zfs-boot-environments-reloaded-2018-11-15.pdf

        1. 1

          Snapshotting /usr/local or /usr/local/etc can be sometimes useful but ZFS Boot Environments is whole a lot better

          These solve different problems. BEs clone the base system. I want to snapshot /usr/local so that I can roll back if pkg upgrade goes wrong. I want a BE so that I can roll back if an upgrade of the base system goes wrong (and it’s super useful - I’ve occasionally made changes to libc that work fine in some tests, done a make installworld and after a reboot discovered that I broke something that init needs - being able to switch back to the old BE from loader is fantastic).

          You’ll note from that presentation that beadm is written in shell and uses zfs snapshot. This means that, as with the /usr/local/etc case, beadm requires zpool/ROOT to be either a single dataset or a tree of datasets. A BE doesn’t contain /usr/local. bectl is part of the base system and is actually written in C as a thin wrapper around libbe (also in the base system). This actually could snapshot multiple independent filesystems atomically (it uses libzfs, which wraps the ZFS ioctls, which take an nvlist of filesystems to snapshot), it just doesn’t.

          If you do bectl list -a then it will show exactly which filesystems are snapshotted. If you then compare this against the output of mount then you’ll see that there are a lot that are not in the tree of any BE. This is usually a useful feature: you can use BEs to test the same set of packages against different base system versions, you don’t normally want to roll back home directories if you discover problems in an update, and so on.

          1. 3

            These solve different problems. BEs clone the base system. I want to snapshot /usr/local so that I can roll back if pkg upgrade goes wrong. I want a BE so that I can roll back if an upgrade of the base system goes wrong (and it’s super useful - I’ve occasionally made changes to libc that work fine in some tests, done a make installworld and after a reboot discovered that I broke something that init needs - being able to switch back to the old BE from loader is fantastic).

            The default FreeBSD installation on ZFS INCLUDES also /usr/local for the BE. This is what I am trying to tell you.

            You’ll note from that presentation that beadm is written in shell and uses zfs snapshot. This means that, as with the /usr/local/etc case, beadm requires zpool/ROOT to be either a single dataset or a tree of datasets. A BE doesn’t contain /usr/local. bectl is part of the base system and is actually written in C as a thin wrapper around libbe (also in the base system). This actually could snapshot multiple independent filesystems atomically (it uses libzfs, which wraps the ZFS ioctls, which take an nvlist of filesystems to snapshot), it just doesn’t.

            I know because I am the author of beadm(8) command.

            Both bectl(8) and beadm(8) require zpool/ROOT approach … and yes BE by default contains /usr/local directory. The beadm(8) uses zfs snapshot -r command that means that it does RECURSUVE snapshot. The same is done in bectl(8). Does not matter that bectl(8) uses libbe. They work the same.

            If you do bectl list -a then it will show exactly which filesystems are snapshotted.

            Please see the presentation again - especially the 42/43/44 pages which tell you exactly the info you need. The /usr/local IS INCLUDED in BE with the default FreeBSD install on ZFS.

            1. 1

              The default FreeBSD installation on ZFS INCLUDES also /usr/local for the BE. This is what I am trying to tell you.

              The default ZFS install doesn’t put /usr/local in a separate ZFS dataset. This is one of the first things that I need to fix on any new FreeBSD install, before I install any packages. In the vast majority of cases, I don’t want /usr/local to be in my BE because if I change something in a package config and then discover I need to roll back to a prior BE then I’d lose that change. In my ideal world, /etc would not contain system-provided rc.d scripts, defaults, or any of the other system-immutable things that have ended up there and /etc would be omitted from the BE as well so that I didn’t lose config changes on BE rollback, but that’s not possible while /etc needs to be mounted before init runs and while it contains a mix of user- and system-provided things.

      2. 1

        Newish ZFS user here.

        What do you mean by:

        The ZFS kernel interface allows you to atomically create a set of snapshots.

        Specifically, what is a “set of snapshots”?

        1. 2

          Not really a ZFS thing, actually - more of an ACID or data integrity concept. Each ZFS volume is a completely separate entity; it is configured separately, managed separately, and even has its own isolated IO (you have to copy-and-delete the entirety of a file to copy it to a different dataset even if it’s on the same zpool).

          A regular snapshot doesn’t make any atomicity guarantees with regards to a snapshot of a different ZFS dataset: if your app writes to a ZFS dataset at /var/db/foo.db and logs to the separate ZFS dataset at /var/log/foo, if you snapshot both “regularly” and then restore, you might find that the log references data that isn’t found in the db, because the snapshots weren’t synchronized. An atomic set of snapshots would not run into that.

          (But I thought recursive snapshots of / would give you atomic captures of the various child datasets, so it’s exposed in that fashion, albeit in an all-or-nothing approach?)

        2. 1

          I want to do zfs snapshot zroot/usr/local@1 zroot/usr/local/etc@1 zroot/var/log@1 or similar. It turns out I can do this now. Not sure when I was added, but very happy that it’s there now.

      3. 1

        This feature is so useful that the authors of the zfs command-line tool decided not to expose it to users.

        Is that not what zfs snapshot -r does? (note the -r). I think it’s supposed to create a set of snapshots atomically. Granted, they all have to be descendants of some dataset, but that’s not necessarily a big issue because the hierarchy of ZFS datasets need not correspond to the filesystem hierarchy (you can set the mountpoint property to mount any dataset in whatever path you want).

        Also, I think ZFS channel programs also allow you to do that atomically, but with a lot more flexibility (e.g. no need for the snapshots to be descendants of the same dataset, and you can also perform other ZFS administration commands in-between the snapshots if you want), since it basically allows you to create your own Lua script that runs at the kernel level, atomically, when ZFS is synchronizing the pools. See man zfs-program (8).

        1. 1

          -r creates a set of snapshots of a complete tree. It doesn’t allow you to atomically create a set of snapshots atomically for datasets that don’t have a strict parent-child relationship. For example, with Boot Environments, / is typically zroot/ROOT/${current_be_name} and /usr/local is zroot/usr/local so you can’t snapshot both together with the command-ine tool.

          The ioctl that this uses doesn’t actually doesn’t do anything special for recursive snapshots. It just takes an nvlist that is a list of dataset names and snapshots them all. When you do zfs -r, the userspace code collects a set of names of datasets and then passes them to the ioctl. This is actually racy because if a dataset is created in the tree in the middle of the operation then it won’t be captured in the snapshot, so a sequence from another core of ‘create child dataset’ then ‘create symlink in parent to file in child’ can leave the resulting snapshots in an inconsistent state because they’ll capture the symlink but not the target. In practice, this probably doesn’t matter for most uses of -r.

          Channel programs do allow this, but they don’t compose well with other ZFS features. In particular, they’re restricted to root (for good reason: it’s generally a bad idea to allow non-root users to run even fairly restricted code in the kernel because they can use it to mount side-channel attacks or inject handy gadgets for code-reuse attacks. This is why eBPF in Linux is so popular with attackers). This means that you can’t use them with ZFS delegated administration. It would be a lot better if channel programs were objects in some namespace so that root could install them and other users could then be authorised to invoke them.

          On the other hand, the kernel interface already does exactly what I want and the libzfs_core library provides a convenient C wrapper, I’m just complaining that the zfs command-line tool doesn’t expose this.

          1. 3

            Actually, it looks as if I’m wrong. zfs snapshot can do this now. I wonder when that was added…

          2. 1

            It would be a lot better if channel programs were objects in some namespace so that root could install them and other users could then be authorised to invoke them.

            Can’t you create a script as root and then let other users invoke it with doas or sudo?

            1. 1

              Not from a jail, no (unless you want to allow a mechanism for privilege elevation from a jail to the host, which sounds like a terrible idea). In general, maybe if you wanted to put doas or sudo in your TCB (I’d be more happy with doas, but sudo‘s security record isn’t fantastic). But now if a program wants to run with delegated administration and provide a channel script it also needs to provide all of this privilege elevation machinery and there are enough fun corner cases that it will probably get it wrong and introduce new security vulnerabilities. Oh, and the channel program code doesn’t know who it’s running as, so you end up needing some complex logic to check the allow properties there to make sure that the channel program is not run by a user who shouldn’t be allowed to use it.

              I’d love to see installing and running channel programs completely separated so that only unjailed root could install them (though they could be exposed to jails) and then any user could enumerate them and invoke them with whatever parameters they wanted, but the channel program then runs with only the rights of that user, so they couldn’t do anything with a channel program that delegated administration didn’t let them do anyway.