Threads for wizeman

  1. 4

    Just a word of caution: I’ve tried replacing the mitigation patch (which I had been using for 18 months) with the author’s minimal fix (the first out of the 6 patches, which is supposed to fix the issue) and I’ve immediately ran into a deadlock while building Poly/ML 5.9 (3 times, in fact), whereas previously I hadn’t encountered this bug before, despite having built Poly/ML many times.

    This was on a 32-core machine with lots of background compilation going on at the same time (i.e. with a high load).

    It could be a coincidence or an unrelated bug, but it’s also possible that the author’s patch might not fix the issue completely.

    1. 2

      Speaking of verification, I have a seemingly simple problem the systems I tried it on (TLA+, Coq) seem to be unable to address (or, more likely, I don’t have the tools).

      So I have these two integers a and b, that are in [0, K] (where K is a positive integer). I would like to prove the following:

      • a + b ≤ 2 K
      • a × b

      Should be easy, right? Just one little snag: K is often fairly big, typically around 2^30 (my goal here is to prove that a given big number arithmetic never causes limb overflow). I suspect naive SAT solving around Peano arithmetic is not going to cut it.

      1. 4

        This should be pretty easy for any modern SMT solver to prove. I’m not exactly an expert, but this seems to work for Z3:

        ; Declare variables/constants
        
        (declare-const K Int)
        (declare-const a Int)
        (declare-const b Int)
        
        ; Specify the properties of the variables
        
        (assert (> K 0))
        
        (assert (>= a 0))
        (assert (>= b 0))
        
        (assert (<= a K))
        (assert (<= b K))
        
        ; Now let's prove facts
        
        (push) ; Save context
        
        ; Note how I've actually inserted the opposite statement of what you are trying to prove, see below as to why
        
        (assert (> (+ a b) (* 2 K)))
        (check-sat)
        
        ; If you get an `unsat` answer, it means your statement is proved
        ; If instead you get a `sat` answer, you can use the (get-model) command here
        ; to get a set of variable assignments which satisfy all the assertions, including
        ; the assertion stating the opposite of what you are trying to prove
        
        (pop) ; Restore context
        
        (assert (> (* a b) (* K K)))
        (check-sat)
        
        ; See above for the comment about the (get-model) command
        

        Save that to a file and then run z3 <file.smt>.

        Z3 should give you 2 unsat answers in a fraction of a second, which means that your 2 statements were proven to be true. Notably, it proves this for any K > 0 (including 2^30, 2^250, 2^1231823, etc…)

        As far as I understand, the biggest gotcha is that you have to negate the statement that you are trying to prove and then let the SMT solver prove that there is no combination of values for the a, b and K integers that satisfy all the assertions. It’s a bit unintuitive at first, but it’s not hard to get used to it.

        1. 3

          my goal here is to prove that a given big number arithmetic never causes limb overflow

          I’m not exactly sure what you mean here, is it that you’re using modulo arithmetic? If not, I’ve got a little proof here in Coq:

          Theorem for_lobsters : forall a b k : nat,
            a<=k /\ b<=k -> a+b <= 2*k /\ a*b <= k*k.
          Proof.
            split.
            - lia.
            - now apply PeanoNat.Nat.mul_le_mono.
          Qed.
          

          I think even if you’re doing modulo arithmetic, it shouldn’t be too hard to prove the given lemmas. But you might need to put some tighter restrictions on the bounds of a and b. For example requiring that a and b are both less than sqrt(k) (though this is too strict).

          1. 1

            My, I’m starting to understand why I couldn’t prove that trivial theorem:

            • I’m not sure what “split” means, though I guess it splits conjunction in the conclusion of the theorem into 2 theorems…
            • I have no idea what lia means.
            • I have no idea how PeanoNat.Nat.mul_le_mono applies here. I guess I’ll have to look at the documentation.

            Thanks a lot though, I’ll try this out.

            1. 2

              I’m not sure what “split” means, though I guess it splits conjunction in the conclusion of the theorem into 2 theorems…

              Yep!

              I have no idea what lia means.

              The lia tactic solves arithmetic expressions. It will magically solve a lot of proofs that are composed of integer arithmetic. The docs on lia can be found here. Note that I omitted an import statement in the snippet above. You need to prepend From Coq Require Import Lia to use it.

              I have no idea how PeanoNat.Nat.mul_le_mono applies here. I guess I’ll have to look at the documentation.

              The mul_le_mono function has the following definition, which almost exactly matches the goal. I found it using Search "<=".

              PeanoNat.Nat.mul_le_mono
                   : forall n m p q : nat, n <= m -> p <= q -> n * p <= m * q
              

              I used now apply ... which is shorthand for apply ...; easy. The easy tactic will try to automatically solve the proof using a bunch of different tactics. You could do without the automation and solve the goal with apply PeanoNat.Nat.mul_le_mono; destruct H; assumption, if you’re so inclined.

              I hope this is helpful!

        1. 22

          Perhaps SML (Standard ML)? Unusually, it has a formal specification of its typing rules and operational semantics, as well as a standard library, all of which are somewhat set in stone. By that I mean that apparently, those specifications haven’t changed in 25 years, even though it is still widely used in certain areas (such as proof assistants). That said, there’s been efforts to improve/extend it, under the umbrella “Successor ML”.

          1. 10

            Yeah, for the particular needs that OP is describing, SML is ideal. You can take any SML program written in the last 25 years and compile with any SML compiler (modulo implementation-specific extensions), and that will remain the case for the next 100 years, because of the specification.

            Now, whether that is a good thing for the language ecosystem and adoption is another question entirely.

            1. 7

              I agree. SML is a dead language, which is what the OP is asking for, yet it is functional and reasonably capable. It does not have a very active community, and indeed OCaml, F# or GHC Haskell are probably more pleasant to use in practice, but they are not stable as languages.

              1. 11

                Precisely: “dead” is a feature here, if we mean dead as in “no longer being extended”.

                After all things considered and looked at, SML is 100% what I’m going to be looking at. It looks great!

                EDIT: I think everyone should take a minute to read https://learnxinyminutes.com/docs/standard-ml/ … It is beautiful! FFI looks very easy too. Next step is to implement an RSS feed manager with SML + SQLite.

                1. 4

                  It’s also not wasted time. SML is used as the foundation for a lot of PLT research because it is so formally defined and provides a nice basis for further work.

                  1. 3

                    To be fair, Haskell98 is also quite old and dead. It’s hard to get a recent compiler into pure H98 mode (especially wrt stdlib) but old ones still exist and don’t need to change. Haskell2010 is the same just less old. Only “GHC Haskell” is a moving target, but it’s also the most popular.

                    1. 2

                      I would care more if there was a Haskell98 which people were actively improving performance wise.

                      1. 3

                        Why? Do you find GHC’s H98 under-performing?

              2. 7

                OCaml and Haskell are well ahead in terms of features as well as adoption. F# and Scala are not going anywhere. That said, I really like SML and glad that mlton, MLKit, PolyML,SML/NJ, SML# are all alive. If only someone (don’t think I have the chops for it) could resuscitate SML.NET

                1. 4

                  OCaml and Haskell are well ahead in terms of features

                  I think OP wants stability over features. Some people consider new features “changes” because what is idiomatic may change

                2. 2

                  I think this is the right answer. I have had similar impulses to the OP and always came back to Standard ML as the thing that comes closest to fulfilling this purpose. I just wish it were less moribund and a little closer to Haskell in terms of aesthetics.

                1. 3

                  I haven’t decided yet if I agree with the conclusion of this article. I recently set up a new bare-metal server using ZFS on Linux, and the fact that the ARC is separate from the normal Linux page cache was bugging me. I found this article while looking for information on possible implications of that issue. I do like ZFS’s unification of filesystem and volume manager (the “layering violation”), the snapshot support, and of course, the emphasis on data integrity.

                  1. 6

                    Not sure about Linux but on FreeBSD the buffer cache was extended to allow it to contain externally-owned pages. This means that ARC maintains a pool of pages for disk cache (and grows and shrinks based on memory pressure in the system) and the same pages are exposed to the buffer cache. Prior to this (I think this landed in FreeBSD 10?), pages needed to be copied from the ARC into the buffer cache before they could be used to service I/Os and so you ended up with things being cached twice. Now, ARC just determines which pages remain resident from the storage pool.

                    1. 2

                      On Linux, the free command counts the ARC under “used”, not “available” like the standard Linux page cache.

                      Having read all of the comments here, I think I’ll stick with ZFS. But I still need to find out if there are any unusual high-memory-pressure situations I should watch out for.

                      1. 6

                        FYI, here’s what I’ve observed regarding memory pressure after many years of running OpenZFS on Linux on multiple machines, each with multiple ZFS pools, some using hard disks and others using SSDs (and some using HDDs with SSDs as caches), and also including ZFS as the root filesystem on all systems:

                        • Many years ago I ran into a situation where under heavy memory pressure the ARC would shrink too much. This caused the working set of ZFS metadata and/or commonly accessed data to constantly being evicted from cache and therefore the system appeared to almost hang because it was too busy doing I/Os of recently-evicted blocks instead of making useful progress. The workaround was simple: set zfs_arc_min parameter to force arc_c_min to be 1/16th of the system memory (which is 6.25%), rather than the default 1/32th of system memory (which is 3.125%). As an example, on a Raspberry Pi with 8 GiB of memory I add the zfs.zfs_arc_min=536870912 Linux kernel parameter on my boot/grub config (which sets the minimum ARC size to be 512 MiB rather than the default 256 MiB) and on a server with 128 GiB I add zfs.zfs_arc_min=8589934592 (i.e. 8 GiB rather than the default 4 GiB), etc (you get the idea). Since I did this, I never observed this behavior anymore, even under the same (or different) stressful workloads.
                        • On recent OpenZFS versions, on my systems, I’ve observed that the ARC growing/shrinking behavior is flawless compared to a few years ago, even under heavy memory pressure. However, the maximum ARC size is still too conservative by default (50% of memory). This meant that under low memory pressure, the ARC would never grow beyond 64 GiB of memory on my 128 GiB server, leaving almost 64 GiB of memory completely unused at all times (instead of using it as a block cache for more infrequently used data or metadata). So I just add another kernel parameter to set the maximum ARC size to be 90% of system memory instead. e.g. On my 8 GiB Raspberry Pis I add zfs.zfs_arc_max=7730941132 (i.e. 7.2 GiB) and on my 128 GiB server I add zfs.zfs_arc_max=123695058124 (i.e 115.2 GiB). Since then, the ARC is free to grow more and use otherwise unused memory as a cache.
                        • Although strictly not a memory issue, many years ago I also added the zfs.zfs_per_txg_dirty_frees_percent=0 parameter, which disables some throttling which was causing me problems, although I don’t remember the exact details. This might not longer be necessary (not sure).

                        Apart from that I haven’t observed any other issue. Although please have in mind that I use zram swap on all my systems and I don’t use dedup, RAID-Z or ZFS encryption (I use LUKS encryption instead), etc. So your mileage might vary depending on your system hardware/software config and/or your workloads, especially if they are somewhat extreme in some way.

                        1. 1

                          If you find out, will you leave a link here or PM me? I too do not have a good mental model here and that slightly worries me.

                      2. 1

                        and the fact that the ARC is separate from the normal Linux page cache was bugging me

                        How separate is it? I know that /proc/sys/vm/drop_caches will drop it in the same manner as the normal page cache.

                      1. 61

                        Please don’t pay attention to this article, it’s almost completely wrong. I think the author doesn’t know what he’s talking about. I will go point by point:

                        • Out-of-tree and will never be mainlined: Please don’t use zfs-fuse, it’s long been abandoned and OpenZFS is better in every respect. The rest of the points I guess are true (or might be, eventually, sure).
                        • Slow performance of encryption: This seems to be completely wrong. I believe OpenZFS re-enabled vector instructions with their own implementation of the kernel code that can no longer be used. For an example, see https://github.com/openzfs/zfs/pull/9749 which was merged many months after the Linux kernel disabled vector instructions.
                        • Rigid: This was done deliberately so people like the author don’t shoot themselves in the foot. It would actually have been easier to make the vdev hierarchy more flexible, but ZFS is more strict on purpose, so users don’t end up with bad pool configurations.
                        • Can’t add/remove disks to RAID: I guess this is still true? I’m not entirely sure because I’m not following OpenZFS development closely nor do I use RAID-Z.
                        • RAID-Z is slow: As far as I know this is correct (in terms of IOPS), so RAID-Z pools are more appropriate for sequential I/O rather than random I/O.
                        • File-based RAID is slow: OpenZFS can now do scrubs and resilvers (mostly) sequentially, so this point is wrong now.
                        • Real-world performance is slow: I wouldn’t call it slow, but ZFS can be slower than ext4, sure (but it’s also doing a lot more than ext4, on purpose, such as checksumming, copy-on-write, etc).
                        • Performance degrades faster with low free space: The free-space bitmap comment is just weird/wrong, because ZFS actually has more scalable data structures for this than most other filesystems (such as ext4). It might be true that ZFS fragments more around 80% utilization than ext4, but this is probably just a side-effect of copy-on-write. Either way, no filesystem will handle mostly full disks very well in terms of fragmentation, so this is not something specific to ZFS, it’s just how they (have to) work.
                        • Layering violation of volume management: This is completely wrong. You can use other filesystems on top of a ZFS pool (using ZVols) and you can use ZFS on top of another volume manager if you want (but I wouldn’t recommend it), or even mix it with other filesystems on the same disk (each on their own partition). Also, you can set a ZFS dataset/filesystem mountpoint property to legacy and then use normal mount/umount commands if you don’t like ZFS’s automounting functionality.
                        • Doesn’t support reflink: This is correct.
                        • High memory requirements for dedupe: The deduplication table is actually not kept in memory (except that a DDT block is cached whenever it’s read from disk, as any other metadata). So as an example, if you have some data that is read-only (or mostly read-only) you can store it deduped and (apart from the initial copy) it will not be any slower than reading any other data (although modification or removal of this data will be slower if ZFS has to keep reading DDT blocks from disk due to them being evicted from cache).
                        • Dedupe is synchronous: Sure it’s synchronous, but IOPS amplification will mostly be observed only if the DDT can’t be cached effectively.
                        • High memory requirements for ARC: I don’t even know where to begin. First of all, the high memory requirements for the ARC have been debunked numerous times. Second, it’s normal for the ARC to use 17 GiB of memory if the memory is available otherwise – this is what caches (such as the ARC) are for! The ARC will shrink whenever memory is otherwise needed by applications or the rest of the kernel, if needed. Third, I use OpenZFS on all my machines, none of them are exclusively ZFS hosts, and there is exactly zero infighting in all of them. Fourth, again, please just ignore zfs-fuse, there is no reason to even consider using it in 2022.
                        • Buggy: All filesystems have bugs, that’s just a consequence of how complicated they are. That said, knowing what I know about the ZFS design, code and testing procedures (which is a lot, although my knowledge is surely a bit outdated), I would trust ZFS with my data above any other filesystem, bar none.
                        • No disk checking tool: This is actually a design decision. Once filesystems get too large, fsck doesn’t scale anymore (and it implies downtime, almost always), so the decision was made to gracefully handle minor corruption while the machine is running and being used normally. Note that a badly corrupted filesystem will of course panic, as it likely wouldn’t even be possible to recover it anymore, so it’s better to just restore from backups. But you can also mount the ZFS pool read-only to recover any still-accessible data, even going back in time if necessary!

                        In conclusion, IMHO this article is mostly just FUD.

                        1. 21

                          This is actually a design decision.

                          A question on my mind while reading this was whether or not the author knows ZFS enough to be making some of these criticisms honestly. They seem like they should, or could. I am not attacking their intelligence, however I would prefer to see a steelman argument that acknowledges the actual reasons for ZFS design choices. Several of the criticisms are valid, but on the topic of fsck, ARC and layering the complaints appear misguided.

                          I spent 7 years using the solution they recommend (LUKS+btrfs+LVM) and have been moving to ZFS on all new machines. I’ll make that a separate top-level comment, but I wanted to chime in agreeing with you on the tone of the article.

                          1. 7

                            I’m not sure the check tool is really not needed. It’s not something I want to run on mount / periodically. I want a “recovery of last resort” offline tool instead and it doesn’t have to scale because it’s only used when things are down anyway. If there’s enough use case to charge for this (https://www.klennet.com/zfs-recovery/default.aspx) there’s enough to provide it by default.

                            1. 4

                              In general we try to build consistency checking and repair into the main file system code when we can; i.e., when doing so isn’t likely to make things worse under some conditions.

                              It sounds like what you’re after is a last ditch data recovery tool, and that somewhat exists in zdb. It requires experience and understanding to hold it correctly but it does let you lift individual bits of data out of the pool. This is laborious, and complicated, and likely not possible to fully automate – which is why I would imagine many folks would prefer to pay someone to try to recover data after a catastrophe.

                            2. 5

                              Dedup does generally have high memory requirements if you want decent performance on writes and deletes; this is a famous dedup limitation that makes it not feasible in many situations. If the DDT can’t all be in memory, you’re doing additional random IO on every write and delete in order to pull in and check the relevant section of the DDT, and there’s no locality in these checks because you’re looking at randomly distributed hash values. This limitation isn’t unique to ZFS, it’s intrinsic in any similar dedup scheme.

                              A certain amount of ZFS’s nominal performance issues are because ZFS does more random IOs (and from more drives) than other filesystems do. A lot of the stories about these performance issues date from the days when hard drives were dominant, with their very low IOPS figures. I don’t think anyone has done real performance studies in these days of SSDs and especially NVMe drives, but naively I would expect the relative ZFS performance to be much better these days since random IO no longer hurts so much.

                              (At work, we have run multiple generations of ZFS fileservers, first with Solaris and Illumos on mostly hard drives and now with ZoL on Linux on SATA SSDs. A number of the performance characteristics that we care about have definitely changed with the move to SSDs, so that some things that weren’t feasible on HDs are now perfectly okay.)

                            1. 4

                              This post obliquely hints at two of my pet peeves with ZFS:

                              • The ZFS kernel interface allows you to atomically create a set of snapshots. This is fantastic for package updates if you want, for example, to keep /usr/local/etc and /usr/local/ in separate datasets so that you can back up the config files easily but not bother backing up the whole thing. This feature is so useful that the authors of the zfs command-line tool decided not to expose it to users.
                              • The allow mechanism is incredibly coarse-grained. If I want to say ‘my backup user can create snapshots and delete them, but can’t delete snapshots that it didn’t create’ then I’m out of luck: that’s not something that the permissions model can express. Snapshots don’t come with owners. My backup system should be trusted with preserving the confidentiality of my data (it can, after all, read the data unless I’m backing up encrypted ZFS datasets directly) but it shouldn’t have the ability to destroy the datasets that it’s backing up. Yet that’s not something that zfs allow can express.
                              1. 3

                                Snapshotting /usr/local or /usr/local/etc can be sometimes useful but ZFS Boot Environments is whole a lot better - securing entire system and also both /usr/local and /usr/local/etc - more on that here:

                                https://vermaden.files.wordpress.com/2018/11/nluug-zfs-boot-environments-reloaded-2018-11-15.pdf

                                1. 1

                                  Snapshotting /usr/local or /usr/local/etc can be sometimes useful but ZFS Boot Environments is whole a lot better

                                  These solve different problems. BEs clone the base system. I want to snapshot /usr/local so that I can roll back if pkg upgrade goes wrong. I want a BE so that I can roll back if an upgrade of the base system goes wrong (and it’s super useful - I’ve occasionally made changes to libc that work fine in some tests, done a make installworld and after a reboot discovered that I broke something that init needs - being able to switch back to the old BE from loader is fantastic).

                                  You’ll note from that presentation that beadm is written in shell and uses zfs snapshot. This means that, as with the /usr/local/etc case, beadm requires zpool/ROOT to be either a single dataset or a tree of datasets. A BE doesn’t contain /usr/local. bectl is part of the base system and is actually written in C as a thin wrapper around libbe (also in the base system). This actually could snapshot multiple independent filesystems atomically (it uses libzfs, which wraps the ZFS ioctls, which take an nvlist of filesystems to snapshot), it just doesn’t.

                                  If you do bectl list -a then it will show exactly which filesystems are snapshotted. If you then compare this against the output of mount then you’ll see that there are a lot that are not in the tree of any BE. This is usually a useful feature: you can use BEs to test the same set of packages against different base system versions, you don’t normally want to roll back home directories if you discover problems in an update, and so on.

                                  1. 3

                                    These solve different problems. BEs clone the base system. I want to snapshot /usr/local so that I can roll back if pkg upgrade goes wrong. I want a BE so that I can roll back if an upgrade of the base system goes wrong (and it’s super useful - I’ve occasionally made changes to libc that work fine in some tests, done a make installworld and after a reboot discovered that I broke something that init needs - being able to switch back to the old BE from loader is fantastic).

                                    The default FreeBSD installation on ZFS INCLUDES also /usr/local for the BE. This is what I am trying to tell you.

                                    You’ll note from that presentation that beadm is written in shell and uses zfs snapshot. This means that, as with the /usr/local/etc case, beadm requires zpool/ROOT to be either a single dataset or a tree of datasets. A BE doesn’t contain /usr/local. bectl is part of the base system and is actually written in C as a thin wrapper around libbe (also in the base system). This actually could snapshot multiple independent filesystems atomically (it uses libzfs, which wraps the ZFS ioctls, which take an nvlist of filesystems to snapshot), it just doesn’t.

                                    I know because I am the author of beadm(8) command.

                                    Both bectl(8) and beadm(8) require zpool/ROOT approach … and yes BE by default contains /usr/local directory. The beadm(8) uses zfs snapshot -r command that means that it does RECURSUVE snapshot. The same is done in bectl(8). Does not matter that bectl(8) uses libbe. They work the same.

                                    If you do bectl list -a then it will show exactly which filesystems are snapshotted.

                                    Please see the presentation again - especially the 42/43/44 pages which tell you exactly the info you need. The /usr/local IS INCLUDED in BE with the default FreeBSD install on ZFS.

                                    1. 1

                                      The default FreeBSD installation on ZFS INCLUDES also /usr/local for the BE. This is what I am trying to tell you.

                                      The default ZFS install doesn’t put /usr/local in a separate ZFS dataset. This is one of the first things that I need to fix on any new FreeBSD install, before I install any packages. In the vast majority of cases, I don’t want /usr/local to be in my BE because if I change something in a package config and then discover I need to roll back to a prior BE then I’d lose that change. In my ideal world, /etc would not contain system-provided rc.d scripts, defaults, or any of the other system-immutable things that have ended up there and /etc would be omitted from the BE as well so that I didn’t lose config changes on BE rollback, but that’s not possible while /etc needs to be mounted before init runs and while it contains a mix of user- and system-provided things.

                                2. 1

                                  Newish ZFS user here.

                                  What do you mean by:

                                  The ZFS kernel interface allows you to atomically create a set of snapshots.

                                  Specifically, what is a “set of snapshots”?

                                  1. 2

                                    Not really a ZFS thing, actually - more of an ACID or data integrity concept. Each ZFS volume is a completely separate entity; it is configured separately, managed separately, and even has its own isolated IO (you have to copy-and-delete the entirety of a file to copy it to a different dataset even if it’s on the same zpool).

                                    A regular snapshot doesn’t make any atomicity guarantees with regards to a snapshot of a different ZFS dataset: if your app writes to a ZFS dataset at /var/db/foo.db and logs to the separate ZFS dataset at /var/log/foo, if you snapshot both “regularly” and then restore, you might find that the log references data that isn’t found in the db, because the snapshots weren’t synchronized. An atomic set of snapshots would not run into that.

                                    (But I thought recursive snapshots of / would give you atomic captures of the various child datasets, so it’s exposed in that fashion, albeit in an all-or-nothing approach?)

                                    1. 1

                                      I want to do zfs snapshot zroot/usr/local@1 zroot/usr/local/etc@1 zroot/var/log@1 or similar. It turns out I can do this now. Not sure when I was added, but very happy that it’s there now.

                                    2. 1

                                      This feature is so useful that the authors of the zfs command-line tool decided not to expose it to users.

                                      Is that not what zfs snapshot -r does? (note the -r). I think it’s supposed to create a set of snapshots atomically. Granted, they all have to be descendants of some dataset, but that’s not necessarily a big issue because the hierarchy of ZFS datasets need not correspond to the filesystem hierarchy (you can set the mountpoint property to mount any dataset in whatever path you want).

                                      Also, I think ZFS channel programs also allow you to do that atomically, but with a lot more flexibility (e.g. no need for the snapshots to be descendants of the same dataset, and you can also perform other ZFS administration commands in-between the snapshots if you want), since it basically allows you to create your own Lua script that runs at the kernel level, atomically, when ZFS is synchronizing the pools. See man zfs-program (8).

                                      1. 1

                                        -r creates a set of snapshots of a complete tree. It doesn’t allow you to atomically create a set of snapshots atomically for datasets that don’t have a strict parent-child relationship. For example, with Boot Environments, / is typically zroot/ROOT/${current_be_name} and /usr/local is zroot/usr/local so you can’t snapshot both together with the command-ine tool.

                                        The ioctl that this uses doesn’t actually doesn’t do anything special for recursive snapshots. It just takes an nvlist that is a list of dataset names and snapshots them all. When you do zfs -r, the userspace code collects a set of names of datasets and then passes them to the ioctl. This is actually racy because if a dataset is created in the tree in the middle of the operation then it won’t be captured in the snapshot, so a sequence from another core of ‘create child dataset’ then ‘create symlink in parent to file in child’ can leave the resulting snapshots in an inconsistent state because they’ll capture the symlink but not the target. In practice, this probably doesn’t matter for most uses of -r.

                                        Channel programs do allow this, but they don’t compose well with other ZFS features. In particular, they’re restricted to root (for good reason: it’s generally a bad idea to allow non-root users to run even fairly restricted code in the kernel because they can use it to mount side-channel attacks or inject handy gadgets for code-reuse attacks. This is why eBPF in Linux is so popular with attackers). This means that you can’t use them with ZFS delegated administration. It would be a lot better if channel programs were objects in some namespace so that root could install them and other users could then be authorised to invoke them.

                                        On the other hand, the kernel interface already does exactly what I want and the libzfs_core library provides a convenient C wrapper, I’m just complaining that the zfs command-line tool doesn’t expose this.

                                        1. 3

                                          Actually, it looks as if I’m wrong. zfs snapshot can do this now. I wonder when that was added…

                                          1. 1

                                            It would be a lot better if channel programs were objects in some namespace so that root could install them and other users could then be authorised to invoke them.

                                            Can’t you create a script as root and then let other users invoke it with doas or sudo?

                                            1. 1

                                              Not from a jail, no (unless you want to allow a mechanism for privilege elevation from a jail to the host, which sounds like a terrible idea). In general, maybe if you wanted to put doas or sudo in your TCB (I’d be more happy with doas, but sudo‘s security record isn’t fantastic). But now if a program wants to run with delegated administration and provide a channel script it also needs to provide all of this privilege elevation machinery and there are enough fun corner cases that it will probably get it wrong and introduce new security vulnerabilities. Oh, and the channel program code doesn’t know who it’s running as, so you end up needing some complex logic to check the allow properties there to make sure that the channel program is not run by a user who shouldn’t be allowed to use it.

                                              I’d love to see installing and running channel programs completely separated so that only unjailed root could install them (though they could be exposed to jails) and then any user could enumerate them and invoke them with whatever parameters they wanted, but the channel program then runs with only the rights of that user, so they couldn’t do anything with a channel program that delegated administration didn’t let them do anyway.