1. 56

    1. 11

      I’m pretty excited for bcachefs, I’m really curious to play with some of the features like fs level compression.

    2. 5

      I wish projects like this would clearly say why they exist and why anyone might care to use/contribute/follow them.

      My best guess from a quick search is that the authors either made it just for fun or they made it because they wanted ZFS features but don’t like the license conflict or the btrfs code.

      The proposed value for Linux users seems to be “btrfs but better”.

      1. 16

        I think https://bcachefs.org/ is ok at summarising the main points.

        If you’re interested in the origin, bcache itself is a stable, transparent way to split hot/cold data between multiple fast/slow drives through caching. The author at some point said effectively - you know, that’s halfway to a decent filesystem with good features, so let’s make it happen.

        1. 1

          I did check their site. It told me what but not why.

          1. 7

            The patreon has a lot more about the motivation / why part and an overview of the current options: https://www.patreon.com/bcachefs/about

            1. 8

              Note that the “Status” section is woefully out-of-date. Everything on that list is complete and stable, except for erasure coding (which is not yet stable). I don’t know about the status of DEX/PMEM.

            2. 1

              Thanks, that’s much more helpful.

      2. 13

        Honestly, after fairly extensive experience with using Btrfs during 4 years of running openSUSE as my daily driver, the headline on the website tells you what you need to know:

        “The COW filesystem for Linux that won’t eat your data”.

        That’s it. That’s the story.

        Btrfs is the most featureful FS in the Linux kernel but it is not reliable. I know lots of people are in deep denial about this, but it is not.

        From the Btrfs docs:


        Do not use --repair unless you are advised to do so by a developer or an experienced user, and then only after having accepted that no fsck successfully repair all types of filesystem corruption. E.g. some other software or hardware bugs can fatally damage a volume.

        From the SUSE docs:

        WARNING: Using ‘–repair’ can further damage a filesystem instead of helping if it can’t fix your particular issue.

        It is extremely important that you ensure a backup has been created before invoking ‘–repair’. If any doubt open a support request first, before attempting a repair. Use this flag at your own risk.

        You want a deep examination? Try the Ars one.

        Btrfs is not trustworthy. If you fill the volume, it will corrupt. When it is corrupted, it cannot be fixed and you will lose the contents.

        OpenSUSE uses a tool called “Snapper” to automatically take snapshots before software installation. This means progressively filling the root filesystem. Unless you manually manage this and manually prune them, it will fill up, and when it fills up, it is game over.

        I have been using Linux since kernel 1.0 or so in 1995 (I suspect well over one hundred distros now, if you count major versions) and in the 20th century it was routine for Linux boxes to self-destruct. If the power went at a bad moment, boom, dead.

        21st century Linux is different. It is solid now.

        I used openSUSE for 4 years, and in that time, I had to format and reinstall my OS about half a dozen times or more, because Btrfs self-destructed. This was not a one-off experience.

        Fedora doesn’t have any snapshotting functionality, and so it’s not using Btrfs for the thing that makes Btrfs special. It’s like a moped enjoying perfect engine reliability because you pedal everywhere and never start the engine.

        (Why does it use it then? I think this is why: Fedora leans heavily on Flatpak and is increasing that. Flatpak is extremely inefficient and involves a lot of file duplication. Btrfs can compress that. That means it doesn’t look so bad.)

        Btrfs self-corrupts on full volumes and it does not have a working repair tool.

        That is not acceptable. Therefore Btrfs is not acceptable, and that’s without considering the issues around its RAID handling and so on.

        1. 3

          Yeah. BTRFS is kind of shit. I was recently considering adding more storage to my system and found out that I could select between two allocation policies if my disks were mismatched (or if I added one without balancing). I can use the single profile and it will only write to the bigger disk until it is full (BTRFS’ allocation policy is most free space) or I can pick RAID0 and it will stripe across all disks until eventually it is only writing to a single disks. Both cases hotspotting.

          bcachefs on the other hand has a simple policy that uses disks evenly and causes them to self-balance over time https://bcachefs-docs.readthedocs.io/en/latest/feat-multipledevices.html

        2. 3

          I don’t get the complaint about --repair. This is the same situation as fsck in all filesystems where it exists. If you have a corrupted filesystem, the best a tool can do is make clever guesses about the way to recover. Hopefully the guesses will be right, but they may be wrong. They’re being explicit about it.

          1. 2

            No, it’s not. Have you tried it?

            Like I said, I have fairly extensive, repeated, personal experience of this.

            Fsck is simple and it usually works. I have used it on ext2, ext3, ext4, hfs+, and probably on reiserfs, xfs, and others.

            Usually it fixes some damage, maybe small, and afterwards your filesystem works again. It’s as safe or safer than CHKDSK on the DOS/Windows family. It’s part of my routine maintenance.

            I have never ever seen Btrfs repair successfully fix any form of damage. None at all, not even once.

            What I have seen is a healthy working volume, which I only checked out of curiosity, destroyed by it.

            Don’t be misled. What the warning means is “DO NOT RUN THIS”. It will destroy the volume.

            There is no way to repair a damaged Btrfs volume, even if the damage is trivial.

        3. 2

          I’ve installed openSUSE with Btrfs a few years ago. It worked kind of well, until I’ve tried Android development; the VM was unusable (stuttering, low fps, low performance) if the VM image was stored on a Btrfs partition. When I moved the VM to an ext4 partition, everything was smooth again. I’ve tried to search for the issue on various forums, but couldn’t find any fix for this, so I’ve just resigned from openSUSE and Btrfs altogether, and didn’t look back on it ever again. I haven’t experienced any data losses but from what you write, this was just a matter of time. So it’s good I’ve resigned.

          1. 5

            VMs are usable on btrfs, you just have to make sure the file or directory has copy-on-write disabled (chattr +C).

          2. 1

            That’s interesting. I am not a developer but I run lots of VMs. I never had a problem with that.

            It is 100% possible to run openSUSE on ext4, and it works perfectly. I have run both Leap and Tumbleweed like this.

            Leap is very stable and a good config is / on ext4 and /home on XFS. That is pretty reliable and performant in my experience.

            With T’weed you are at the mercy of a bad update but it’s still doable.

      3. 4

        One of the benefits over btrfs today is that bcachefs supports native encryption like ZFS does, meaning you don’t need to juggle LUKS + btrfs.

        I think for the next while at least btrfs or ZFS will still be preferred but I suspect that bcachefs will be good enough that the (admittedly minor) effort of ensuring OpenZFS on Linux is working such dealing with lack of packages or kernel upgrades breaking things (hello ArchLinux dkms) will make it a success since it’s natively integrated into the kernel.

        A big adopting point would be if container tooling such as containerd start to natively support bcachefs the way btrfs and ZFS are.

      4. 3

        I think bcachefs is significantly better than ZFS or BTRFS due to flexibility. You can pick storage configuration per-directory, chance storage configuration at any time, stripe sizes change automatically as you add disks and mismatched disks are not a problem.

        Overal bcachefs feels like you just add your disks, add your files and it more or less does the best possible thing given your placement constraints. ZFS placement is quite rigid and specified upfront and BTRFS placement is way too stupidly simple.

        I wrote a bit more about this in a blag post a while ago: https://kevincox.ca/2023/06/10/bcachefs-attempt/#bcachefs and a bit more words about the features https://kevincox.ca/2023/06/10/bcachefs-attempt/#summary

        1. 3

          I think bcachefs is significantly better than ZFS or BTRFS due to flexibility.

          Any cross platform support? Native encryption? Zfs send/receive type functionality? Encryption with discard support for SSDs?

          Those are some of the major zfs features for me.

          Ed: I see in a sibling comment that bcachefs support native encryption.

          1. 3

            The only two platforms known to actually support zfs are FreeBSD and Linux. No other filesystem than FAT has a good chance at being writable from every major OS

            1. 6

              The only two platforms known to actually support zfs are FreeBSD and Linux.

              ZFS is also the native filesystem for Illumos and other descendants of Solaris.

            2. 6

              Well, there’s opensolaris of course. NetBSD?

              Looks like MacOs is limping along, and windows label itself as “beta”.



          2. 1

            Can you explain the utility of ZFS send/receive? Specifically, what it offers that other tools do not.

            1. 1

              I can know without further confirmation that the files at the other end are exactly what I was sending, and the transit was minimal. I also can then roll back on the remote server to before that send happened, in case I want to check something that was deleted during that sync.

            2. 1

              other tools

              I’m not sure there are “other tools”. Traditionally filesystems have dump/restore for backup/restore, there’s rsync for incremental sync and restic/borg/duplicity support encryption and authentication.

              Zfs send/receive works zfs snapshots and native encryption to form a rich, yet simple, solution for sync and backup. It works well ssh for network transport - or eg USB media for local backup. It supports remote mounting (mount backup on backup host) if wanted.

              Like most things zfs, it’s the combination of features that creates a powerful emergent whole.

        2. 2

          From your link:

          ZFS […] If you have 2 disks mirrored you can’t add a 3rd disk and still have all data stored twice.

          That isn’t currently the case anymore, you can add a third disk to a mirror setup and it’ll resilver just fine to have triple copies of everything.

          Though the RAID still applies (2.2 adds DRAID, which does let you expand IIRC, at massive complexity cost and has issues still, RAID Expansion is still in the pipes).

          1. 3

            to have triple copies of everything

            Did you make a typo? The requirement is 3 disks and 2x redundancy. I’ll admit that I didn’t look into the configuration too closely but this seemed difficult to do with ZFS. Especially if you were starting from 2 disks and wanted to live migrate to 3.

            My understanding is that at the end of the day you configure your performance/redundancy into a virtual block device. You then put a more-or-less single-disk filesystem on top of that. (It is smarter than your average single-disk filesystem because it is aware of the topology but IIUC the behaviour is basically the same). You can grow by adding more disks to the virtual block device but you don’t really have much option to change the existing devices in the pool. (As you said there are some limited options such as 2 disks replicated can become N disks replicated).

            bcachefs (and BTRFS for this case) are fairly flexible about this. They see that you want 2 copies and they pick two disks and put the data there. Adding more disks gives them more options to pick from. With bcachefs you can get even more flexible and say “everything in this directory shall have 4 copies” and it will replicate that data more times (well it won’t proactively replicate existing data IIUC, it will replicate new writes or on a balance operation) even if that filesystem previously had no configuration with 4x redundancy.

            1. 1

              No typo, all data can be saved thrice.

              ZFS’ internal structures are messy, but essentially there is two types of vdev’s; RAID and Mirror.

              A non-mirror vdev (ie, just a plain disk, no redundancy), is simply a special case of a RAID1 with no partners (though internally it simply presents the device directly as pool member).

              Mirror vdevs can be brought up and down in redundancy by adding or removing disks. If your pool is only mirror disks, you can even split it in half and have two pools with identical data. You can also remove and add mirror vdevs at your pleasure. Even convert single-disks into a mirror configuration.

              RAID vdevs are much more strict. They cannot be expanded once created, they can’t be split and cannot be removed from a pool. This is due to the internal block pointers being in a special format to handle the RAID stripes.

              The RAID Expansion work doesn’t even change that, it instead allows you to rewrite a RAID vdev in a way where it has more members than before. The old data will still only exist at the previous redundancy level (ie, same parity-data ratio) and only new data will benefit from the better parity to data ratio.

              1. 2

                No typo, all data can be saved thrice.

                A mistake in understanding then?

                The objective stated was “a filesystem that has 3 disks, and all data has 2 replicas”. Number of data replicas < number of disks. This saves 50% of your storage space if your failure budget only requires 2 replicas.

      5. 3

        Why would anyone be worried about the ZFS license? Oracle’s never sued anyone over IP they acquired from Sun have they?

        1. 1

          You would have to ask them, but a bunch of people do complain about it.

          Now that this comment the thread has helped me learn more about bcachefs, I don’t think the licensing was a huge deal. The bigger motivations are: bcachefs devs think they can do much better than ZFS or BTRFS; and also ZFS will never be first class on Linux (because Torvalds don’t want it in the kernel, etc)

          1. 3

            Why would anyone be worried about the ZFS license? Oracle’s never sued anyone over IP they acquired from Sun have they?

            You would have to ask them, but a bunch of people do complain about it.

            I think that was a facetious comment; there’s certainly been at least one very high profile instance of Oracle doing exactly that.

            1. 1

              Makes sense, thanks. I had forgotten about that case (generally not very interested in licensing disputes).

          2. 1

            Eh, Torvalds doesn’t ship binaries. It’s companies like RedHat/IBM, Amazon, Google, Samsung and Microsoft that ship kernel binaries and wouldn’t want legal exposure.

            1. 1

              Canonical paid for & published a legal opinion and shipped binaries including ZFS for quite a while.

              OpenZFS not being in the Linux kernel has the social effect that the kernel devs don’t maintain compatibility with it, which has resulted in occasional breakages, which puts a maintenance burden on distributors (which is maybe why Canonical are dropping/mothballing their ZFS interface code?).

              1. 1

                Canonical, a small UK company, is not the same scale of legal target as big US companies like IBM, Google, Microsoft and Amazon. Some of them won’t even ship GPLv3 code let alone mixed GPLv2 & CDDL. I am not a lawyer, but I’ve been close enough to corporate lawyers IP discussions to expect that they prioritize covering their asses from huge law suits over enabling a few long tail filesystem features (from their perspective at least)…

      6. 3

        ZFS features

        Some features bcachefs might have that ZFS lacks are defragmentation and “resizing flexibility” (links are to previous Lobsters comments).

        1. 4

          Thanks, those links led me to the bcachefs manual, and that answered my questions.


          Their value proposition is btrfs or ZFS but better, and especially faster. They think they can achieve it by using a simpler architecture inspired by databases.

      7. 2

        The proposed value for Linux users seems to be “btrfs but better”.

        That’s my rough understanding as well. Though I don’t think it’s supposed to be a direct clone of everything btrfs does — more that there’s been lots of promising filesystem ideas since ext3/4, and btrfs and bcachefs are drawing from that same pool.

        I’m not 100% up to speed on why btrfs failed to take off. Is it just that it had issues with RAID not working/being unsafe?

        1. 5

          As a relatively casual observer, my understanding is that:

          • Even in server and workstation applications, most usecases don’t need the advanced features of btrfs or it’s equivalent in ZFS, making the overheads not worthwhile for some users, and in some devices (embedded, phones, etc) it was not feasible as an EXT4 replacement.

          • COW filesystems tend to impose IO overheads that make them undesirable

          • Some of btrfs’s flagship features excepting COW can be replicated by existing features available to Linux users; XFS permits differential snapshots incremental backup of filesystems which can approximate snapshotting, and LVM2 allows existing filesystems to get features like block device spanning and online partition resizing.

          • Red Hat was the big driver of btrfs adoption, but they abruptly decided to abandon support for btrfs (even as an optional OS) in RHEL starting from RHEL 8, and Fedora consequently only uses btrfs on Workstation, not Server. My understanding is they were sponsoring a lot of the btrfs work, too.

          • For those who truly did need all of the options btrfs provided, ZFS provided more, didn’t have a reputation for corruption and unstability that btrfs had (probably unfairly after a few years), and had a lot more materials and documentation on management and deployment.

          Red Hat in the end decided to create Stratis, a userland program which replicates some btrfs functionality as a management layer over XFS and LVM2, though I’m not sure how widely this is adopted relative to btrfs.

          1. 4

            didn’t have a reputation for corruption and unstability that btrfs had (probably unfairly after a few years)

            In my experience this reputation is not unfair at all when it comes to BTRFS’ advanced features. I’ve been running BTRFS in RAID5/6 from 2016 to beginning of 2023 and have experienced data corruption roughly every year in that period. Can’t complain though, BTRFS tools all warn you about RAID5/6 being completely broken every time you run them.

            1. 6

              I don’t get it. Your setup was unsupported and documented as such with warnings. Why do you think it’s fair to say it’s unstable?

              It’s like randomly moving some files in the system directories after clicking “yes, I know this will cause issues” and saying the system is possible to corrupt and unstable… You’re really on your own in that situation.

              1. 5

                Which is a pretty good demonstration of why one might want to build a new filesystem vs btrfs… IMO RAID5 is a pretty fundamental feature for a modern filesystem, both ZFS and Linux md have had it working fine since forever, and and it’s been well-known to be entirely broken in btrfs for… what, seven years or more?

                Coincidentally, the author apparently started work on bcachefs in 2015.

              2. 1

                I don’t get it. Your setup was unsupported and documented as such with warnings. Why do you think it’s fair to say it’s unstable?

                Because it’s unstable regardless of whether the unstableness is documented or not. Why would it be unfair to say something that is factually true? (This is a genuine question, I am wondering if I am lacking a subtlety in my understanding of the English language.)

                1. 1

                  If you’re doing something explicitly unsupported, then it shouldn’t affect how the system is described. It’s like spraying your computer with water or plugging it into 300V and saying it’s unstable because it crashes - the user guide did tell you not to do that. My phone has an option to enable some experimental GPU features which actually crash when starting games - I can’t say my phone is unstable, just because I explicitly enabled that myself. It’s a option we’re not preventing from taking, but doing those things only says something about us, not about the system.

                  1. 1

                    I can’t say my phone is unstable, just because I explicitly enabled that myself.

                    You can’t say the phone is unstable, but you can say it’s unstable when you enable experimental GPU features, the same way I’m not saying BTRFS is unstable, I’m saying BTRFS is unstable when using is advanced features. I really don’t understand where our disconnect comes from.

                    1. 1

                      Because it’s not unstable when using advanced features. Advanced features like compression, snapshots, many raid levels works just fine. It’s unstable when using experimental unstable features. And pointing that out is not meaningful, because it’s the same for an experimental unstable feature of any project by definition.

                      1. 1

                        Advanced features like compression, snapshots, many raid levels works just fine.

                        I’ve had issues with snapshots and other raid levels too unfortunately (although less frequently than with RAID5/6).

            2. 4

              Why don’t you run a less buggy RAID setup?

              1. 4

                I was just having fun playing with BTRFS, I only used the raid for data I did not care about losing :)

                1. 2

                  Heh, fair enough. Hope you are doing well!

      8. 1

        And a decent name. bcachefs sounds like something Facebook would run on their CDN nodes.

        Side note: sweet buddha, phoronix is full of ad cancer.

        1. 1

          In fact, AIUI, FB is a big Btrfs user.

          But then its data is highly volatile anyway.