1. 20
    1. 4

      Note that bcachefs is merged into kernel 6.7.

      I am really excited about the possibilities of bcachefs. It seems like it would really be a great storage subsystem for virtualization servers.

      1. 1

        I would love to have something like bcachefs. ZFS is far too slow for my needs, but I would like to use raidz2 for one specific backup server. Btrfs can’t actually do something like RAIDz2 without eating my data when shutting it down hard, and btrfs’s RAID1C3 is just a substitute, but not the real deal.

        1. 4

          I would love to have something like bcachefs. ZFS is far too slow for my needs

          Do you know where the slow speed comes from? Typically, when people say ‘ZFS is slow’ they have a workload that is problematic for CoW filesystems. Lots of in-place overwrites that require a write to the log, a write of the new block, and then a write of the log entry for recycling the now-unused block, then a write of the metadata updates for that will be slow, though adding a log device on a fast SSD significantly improves performance here. A workload with lots of random writes and then sequential reads can also be problematic due to the fragmentation introduced during the writes, though adding an L2ARC device can improve this a lot.

          This kind of problem is intrinsic to a CoW filesystem, so switching to a different CoW filesystem is unlikely to help more than tuning configuration parameters on ZFS would. If you have a workload that’s slow as the result of something that’s intrinsic to the design of ZFS rather than as a property of CoW filesystems, I’d be very curious to learn what it is.

          1. 1

            We did some benchmarks on that box previously and for Borgbackups it didn’t work out for us.

            I have a new box coming in, and I’ll test it again whether things changed - we’ll see.

            I don’t know exactly what kind of stuff Borg does under the hood, but at least it’s not a WAL of some DBMS or a VM disk.

            1. 1

              It is possible this my issue will give you some insights into borg performance: https://github.com/borgbackup/borg/issues/7674

              As well as I remember borg flushes OS cache in borg create operation (see the issue for details)

      2. 4

        Some of the static nature of ZFS has changed since the author looked. The block pointer rewriting work has made it possible to do some of the changes that they list as impossible. In particular, you can now add disks to RAID-Z configurations.

        For a hobbyist use case, it’s also worth remembering that zfs send | zfs receive is a useful fallback: if you create a new pool, you can move filesystems across with this mechanism. You can also use this to do things like moving unencrypted datasets into an encrypted state or changing the compression algorithm used.

        But unlike the other filesystems you can also configure durability much more granularly. You can set durability parameters for any subtree, including individual files

        I’m curious how this works. I would assume that if a root has lower reliability than a child then you can get into situations where a drive fails and, although none of the data in the child is lost, you can’t find the entry node. When you combine this with encryption, I’m not sure if finding it would even be feasible. Perhaps there’s some pattern in the block that you can scan for? Going the other way (e.g. having an obj directory that has striping and no mirroring or checksums) seems easier but I’m still not sure that I could reason about the reliability properties of this.

        Most of the other things (checksum algorithm, compression algorithm and level, number of copies, encryption key) can be set on ZFS at a dataset level, which practically means at a directory level since datasets are as easy to create as directories. If you turn on delegated administration, you can give each user the ability to create new datasets under their home directory.

        I’m not sure the slow speed of recursively deleting snapshots in bcachefs is a fair criticism. I think that’s intrinsic to any CoW filesystem. Deleting a single snapshot is, logically, a reference count modification on everything that’s referenced by that snapshot and a delete (and recycle blocks) operation on everything whose reference count hits zero. That’s intrinsically an operation that requires a tree walk. ZFS makes creating a snapshot O(1) in the size of the dataset and accessing files O(1) in the number of snapshots, but has the same tree-walk overhead when deleting a snapshot.

        I’m not really surprised by the data loss. I’m really impressed with the bcachefs design, but it’s still new. The rule of thumb for new filesystems is that it takes 10 years for them to get to the point where they’re reliable enough to use with important data. ZFS was released in 2005. I started using it in 2010, but made sure I had backups of everything on the ZFS disks. I didn’t suffer any data loss, but I was using it in fairly simple configurations. by around 2015, it was at the point I felt confident using it as my default filesystem for everything. As I said in another post, APFS is the only filesystem that I can think of that has been the exception to the ten-year rule. That was helped first by Apple doing a trial migration of every iOS device to it and then by rolling out on iOS first, where it has a single disk, disk access patterns are constrained, Apple could get a load of telemetry from problems, and where most things are backed up in iCloud. It also had a more constrained feature set (and they had Kirk McKusick help design their testing infrastructure, which probably helped a lot).

        1. 3

          Some of the static nature of ZFS has changed since the author looked. The block pointer rewriting work has made it possible to do some of the changes that they list as impossible. In particular, you can now add disks to RAID-Z configurations.

          RaidZ expansion hasn’t been merged yet, has it? It probably will be soon though.

          I’m curious how this works. I would assume that if a root has lower reliability than a child then you can get into situations where a drive fails and, although none of the data in the child is lost, you can’t find the entry node.

          I suppose you’d want to have filesystem metadata always use an especially high durability setting even when making file data less durable.

          APFS

          APFS definitely became pretty good pretty fast, but I’ve seen a decent number of reports of filesystem corruption on MacBooks, caught by people doing fsck before installing Asahi Linux, so I don’t think it’s as reliable as it ought to be.

        2. 3

          As another hobbyist, I have some comments on your thoughts on ZFS.

          ZFS is the giant in this field. Heralded as an improvement on hardware RAID. However, in my opinion it is only a small step above.

          It’s a huge step up in terms of reliability. Not just over hardware RAID but other filesystems as well. And not just in protecting against drive failures. For example, ZFS is extremely resilient against unclean shutdowns such as sudden power loss when you’re using drives without PLP, even when you have zero redundancy. ZFS doesn’t even has a fsck facility. Not because ZFS is immune to all problems, but because the specific kinds of problems a fsck can help with essentially can’t happen with ZFS.

          Even as a home user, the reason I look to filesystems such as the three in this post is because I want to keep my data safe, and I want reliable access to it. And as far as I can tell, ZFS is the only filesystem that can give me that right now. So as nice as more flexibility would be, it’s a secondary concern. I just accept that if I want mirrored data, I have to add 2 drives at a time when expanding.