1. 66
  1. 13

    I wonder why the kernel community seems to have structural issues when it comes to filesystem - btrfs is a bit of a superfund site, ext4 is the best most people have, and ReiserFS’s trajectory was cut short for uh, Reasons. Everything else people would want to use (i.e. ZFS, but also XFS, JFS, AdvFS, etc.) are hand-me-downs from commercial Unix vendors.

    1. 13

      On all of the servers I deploy, I use whatever the OS defaults to for a root filesystem (generally ext4) but if I need a data partition, I reach for XFS and have yet to be disappointed with it.

      Ext4 is pretty darned stable now and no longer has some of the limitations that pushed to me XFS for large volumes. But XFS is hard to beat. It’s not some cast-away at all, it’s extremely well designed, perhaps as well or better than the rest. It continues to evolve and is usually one of the first filesystems to support newer features like reflinks.

      I don’t see why XFS couldn’t replace ext4 as a default filesystem in general-purpose Linux distributions, my best guess as to why it hasn’t is some blend of “not-invented-here” and the fact that ext4 is good enough in 99% of cases.

      1. 3

        It would be great if the recent uplift of xfs also added data+metadata checksums. It would be perfect for a lot of situations where people want zfs/btrfs currently.

        It’s a great replacement for ext4, but not other situations really.

        1. 1

          Yes, I would love to see some of ZFS’ data integrity features in XFS.

          I’d love to tinker with ZFS more but I work in an environment where buying a big expensive box of SAN is preferable to spending time building our own storage arrays.

          1. 1

            I’m not sure if it’s what’s you meant, but XFS now has support for checksums for at-rest protection against bitrot. https://www.kernel.org/doc/html/latest/filesystems/xfs-self-describing-metadata.html

            1. 2

              This only applies to the metadata though, not to the actual data stored. (Unless I missed some newer changes?)

              1. 1

                No, you’re right. I can’t find it but I know I read somewhere in the past six months that XFS was getting this. The problem is that XFS doesn’t do block device management which means at best it can detect bitrot but it can’t do anything about it on its own because (necessarily) the RAIDing would take place in another, independent layer.

          2. 3

            I don’t see why XFS couldn’t replace ext4 as a default filesystem in general-purpose Linux distributions

            It is the default in RHEL 8 for what it’s worth
            https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/managing_file_systems/assembly_getting-started-with-xfs-managing-file-systems

          3. 2

            Yep. I use xfs for 20 years now when I need a single drive FS and I use zfs when I need multiple drive FS. The ext4 and brtfs issues did not increase my confidence.

          4. 9

            Sounds about right.

            Anyone ever used bcachefs in practice? I’ve been watching it off and on for a while now, but it seems slow to mature. Though it advertises itself as more or less stable, it’s not yet feature-complete and isn’t mainlined into the Linux kernel yet.

            1. 8

              While the work seems slow, Kent Overstreet is a very dedicated man. There’s a few commits to bcachefs a couple times a week at least. He’s not really advertising it, but continues to polish the code.

              I haven’t used it yet, (I did use bcache itself a lot) but I’m certainly hoping for the best and chipping in for the development. https://www.patreon.com/bcachefs

              1. 1

                I’m quite excited about it and have been meaning to give it a spin at some point. As soon as its in the linux kernel and easy enough to test drive for my rootfs I plan to switch. I can handle a little experimental if it seems like its at least heading toward being a rock solid FS.

              2. 6

                I ran a server with btrfs on it in grad school for several years. Big regret. I chose it because we wanted the transparent compression on a couple of data directories—we weren’t even using the software raid (the server had a hardware raid controller)—but it ended up requiring a lot of attention to keep the system up and running.

                There are two main issues I ran into:

                1. BTRFS metadata isn’t compacted/cleaned automatically. This meant that I’d periodically need to run whatever btrfs command did that in order to keep things under control. If most of your data sticks around a while, I don’t think this would be a problem, but we ran simulations that would generate large volumes of output that would be post-processed and then deleted, leaving us with a bunch of metadata for things that no longer exist.

                2. BTRFS performs very badly if the disk ever actually gets full. Like, cleaning up the used disk space isn’t enough to restore system performance—its only step 1. On several occasions we had runaway simulations fill the entire raid and had to reboot the system after cleaning up in order to get performance back to acceptable levels even for concurrently-running simulations that weren’t IO-bound. IO just took forever to complete until a reboot, despite plentiful free space.

                I wanted to reformat the server to ext4 before I graduated but then COVID happened and I couldn’t go back into campus to do that, so as far as I know its still running BTRFS.

                EDIT: interesting that /u/Jamietanna actually experienced that IO wait problem too (https://www.jvt.me/posts/2018/12/22/leaving-btrfs/).

                Finally, very recently I’ve been receiving a lot of IO wait, which has been bringing my pretty high end hardware to a halt. Thats the first other mention I’ve found of this problem.

                1. 2

                  I also had big IO wait issues on a btrfs volume. Laptop became unusable.

                2. 4

                  It is the default filesystem in Fedora since 33.

                  1. 4

                    Yes, you read that correctly—you mount the array using the name of any given disk in the array. No, it doesn’t matter which one

                    What the actual f… how was this deemed acceptable by anyone ever?!

                    1. 6

                      Why do you think it’s a problem specifically?

                      1. 4

                        It’s extremely confusing. If a disk device is used in an array, mounting it should return EBUSY. Mounting a disk and mounting an array with the disk are semantically different. A disk != the array containing it.

                        1. 2

                          This idea is not set in stone. It’s just something we’re used to. Fuse started to change that - you can mount a config with credentials to talk to a database, you can mount a key file to publish DNS records, etc. Mount just takes a file descriptor and the selected driver and says “now kiss”.

                          I agree it’s not common, but it can be explained in one sentence, so not that confusing. It saves you from an extra utility otherwise needed to create the array device, which is nice.

                      2. 2

                        This is one of the least offensive parts in an article filled with WTF; seems like an odd thing to focus on.

                      3. 3

                        I decided to use btrfs in production on my personal server around 2013 or 2014, because it looked really neat and the wiki said it was used in production by large companies, including Facebook. I got burned two or three times by btrfs (something weird happened with the array and I had to spend hours copying data off the disk to reinitialize the filesystem - at least, whatever data was actually readable) before giving up on it and going to ZFS. I was using RAID1 almost the entire time, and never RAID5/6. Fool me once…

                        Since then I’ve realized the mismatch in expectations: Facebook or other big companies using btrfs in prod doesn’t mean it’s stable or “production ready”. It means Facebook has a team of kernel engineers who can fix it when it breaks, or advise what exactly can be done without it breaking in an unacceptable way. Most of us do not have that luxury.

                        1. 2

                          I liked btrfs, but moved away from it a couple of years ago because I wasn’t getting the benefits I thought I was

                          1. 2

                            I don’t see the point of shitting on btrfs, if you don’t like it don’t use it.

                            1. 20

                              This is not shitting in btrfs. There are a couple of comments which are personal opinions there, but overall it’s a good, detailed description of existing issues with the system. I welcome that as opposed to the usual “btrfs is unstable” comments without any substance behind them.

                              1. 1

                                A fair point, I guess this was a knee jerk reaction from what you described.

                                I use it on my desktop and had no problems, the snapshots also worked great for me.

                            2. 2

                              What’s the advantage of “filesystem with built-in RAID” in the first place? Layering things on top such as mdadm in Linux or GEOM in FreeBSD seems to work well enough?

                              1. 4

                                For simple mirroring, what do you do when the two disagree? RAID-1 was designed on the assumption that you didn’t get undetectable errors on disks, you’d either get the data you wrote or a read failure. If you get the read failure then you use the version from the other disk. A paper from Google a few years ago showed that single-bit errors are surprisingly common on modern disks, so now what does your block-level RAID-1 implementation do? It needs to ask the filesystem which one is the real one, which requires block-level checksums. But block-level checksums are much more efficient if you are able to write them atomically in a block size that isn’t the raw size of a disk sector, so now your filesystem can’t use your block device layer as a block device layer.

                                Once you’ve managed to correct it, you now need to tell the filesystem layer that it is okay to read the block but it had a transient failure so the filesystem should not use that block anymore. Or you need to add a remapping layer, but you don’t want to do that because your filesystem is already a remapping layer (or, at least, a mapping layer).

                                With most RAID forms, there’s a write hole, where writes are persisted to one device but not all of them in the redundancy set. This means that your higher-level filesystem needs to be aware of this when doing atomic updates (which are the simplest primitive of any filesystem) and understand that the atomicity of your RAID set is not quite the same as the atomicity of a single disk (or you need to wait for the slowest disk to report back on every write, which is bad for performance).

                                When you want to resilver after, for example, a single disk failure in RAID-5 or RAID-6, then a block-level RAID abstraction needs to read every block, even ones that aren’t storing any data. It can’t take advantage of any filesystem knowledge.

                                ZFS provides a richer block-like abstraction. The lowest level is just a persistent blocks store that knows about checksums and provides an atomic transactional update to ranges of blocks. The next layer up provides an object model for storing objects of different sizes and managing updates to parts of an object using CoW semantics. The layers on top of this provide user-facing semantics, such as a POSIX-compatible filesystem or a block device.

                                1. 3

                                  Layering things works really well until you discover that a problem that is mostly visible in one layer needs to be solved at a different layer.

                                  Compression and encryption will fight each other if not done in the right order. Integrity: does that go between the kernel and compression or between compression and RAID? When integrity finds a problem, how does RAID solve it?

                                  With enough layers, performance starts to become an issue.

                                  With enough layers, fixing problems becomes a finger-pointing exercise between dev teams.

                                  I like ext4 on mdadm for boot/root well enough, but if bootloaders worked better with ZFS and there was a little better support in my OS rescue system, I’d probably go to ZFS there as well.

                                  1. 1

                                    Note though that ZFS is layered, it’s just that the layers are different: https://blogs.oracle.com/bonwick/rampant-layering-violation

                                    btrfs, on the other hand, truly has no layers as far as I can tell. Having the RAID layer know about the filesystem layer lets you do neat things, like set only one directory tree* to be RAID1. E.g. you could set /var to be RAID1 since that’s local data, and leave /usr (maybe minus /usr/local) without redundancy because you can just redownload the packages that populated it. Whether this is a cute trick, legitimately useful, or a dangerously untested potential footgun (or some combination thereof) depends on your perspective.

                                    *: I can’t remember if this can be done to any directory tree or if it must be a subvolume but since subvolumes mostly act like child directories it doesn’t really matter.

                                  2. 1

                                    In the context of ZFS, I remember this paper being cited: end-to-end arguments in system design.

                                  3. 1

                                    I had really high hope for btrfs when it first came out. It really seems to have fizzled since. I’m guessing Facebook and, presumably Oracle, still make great use of it, though, I’m curious how exactly.

                                    1. 1

                                      I honestly don’t understand btrfs. Oracle started it because of ZFS licensing issues, but Oracle owns ZFS now. They could just fix the ZFS licensing issue and have a much more mature system right away, but for some reason they don’t. Btrfs feels like mostly sunk cost at this point…

                                      1. 10

                                        From the third paragraph:

                                        Chris Mason is the founding developer of btrfs, which he began working on in 2007 while working at Oracle. This leads many people to believe that btrfs is an Oracle project—it is not. The project belonged to Mason, not to his employer, and it remains a community project unencumbered by corporate ownership to this day. In 2009, btrfs 1.0 was accepted into the mainline Linux kernel 2.6.29.

                                        1. 1

                                          That’s relevant and interesting, but while if Oracle spends no resources contributing to btrfs I guess that reduces the benefit to getting ZFS in Linux, I still wonder why they wouldn’t consider it a win for Oracle Linux users to be able to use ZFS…

                                          1. 1

                                            Does Oracle own zfs with 100% of the copyright assignment? If not, they may not be able to do it, even if they wanted since it would require getting agreement from every contributor.

                                            1. 4

                                              They own the version that came to them, but not what the community has done since then. They already got a small piece GPL’d for use in grub2

                                              1. 4

                                                While Oracle only holds copyright on ZFS, and not the contributions made to OpenZFS, CDDL section 4 provides a method for them to re license community contributions.

                                                1. Versions of the License.

                                                  4.1. New Versions. Sun Microsystems, Inc. is the initial license steward and may publish revised and/or new versions of this License from time to time. Each version will be given a distinguishing version number. Except as provided in Section 4.3, no one other than the license steward has the right to modify this License.

                                                  4.2. Effect of New Versions. You may always continue to use, distribute or otherwise make the Covered Software available under the terms of the version of the License under which You originally received the Covered Software. If the Initial Developer includes a notice in the Original Software prohibiting it from being distributed or otherwise made available under any subsequent version of the License, You must distribute and make the Covered Software available under the terms of the version of the License under which You originally received the Covered Software. Otherwise, You may also choose to use, distribute or otherwise make the Covered Software available under the terms of any subsequent version of the License published by the license steward.

                                                  4.3. Modified Versions. When You are an Initial Developer and You want to create a new license for Your Original Software, You may create and use a modified version of this License if You: (a) rename the license and remove any references to the name of the license steward (except to note that the license differs from this License); and (b) otherwise make it clear that the license contains terms which differ from this License.

                                                Via the acquisition of Sun, Oracle is the steward of the license. And therefore could publish a new version which all existing CDDL code would also be covered by (unless the original developer included a notice prohibiting using of future versions). So to make it GPL compatible, this potential CDDLv2 could include a secondary license clause like the MPLv2 does.

                                                1. 3

                                                  I was under the impression that people were resistant to ZFS due to some GPL vs BSD kinda ideological difference. I wasn’t aware of this, thanks for highlighting it.

                                                  The possibility that the entire OpenZFS effort could, in theory, be appropriated by this section of the CDDL is alarming, and now I have a better understanding why Linux/GPL folks are opposed to this license.

                                                  Does this mean that the stewards can take OpenZFS and re-license it as proprietary project, which is essentially a cease-and-desist to all ZFS users everywhere? Of course, in terms of community outrage and developer goodwill it will be a PR disaster, but if ever the winds of business change in a manner where this might not be the worst thing to do, all bets are off.

                                                  1. 8

                                                    Does this mean that the stewards can take OpenZFS and re-license it as proprietary project, which is essentially a cease-and-desist to all ZFS users everywhere?

                                                    No, anything out and available as CDDL v1 will continue to be available under v1. Section 4.2 covers continuing to use the original version it was published under.

                                                    4.2. Effect of New Versions. You may always continue to use, distribute or otherwise make the Covered Software available under the terms of the version of the License under which You originally received the Covered Software.

                                                    Note this isn’t anything unusual, the GPL has a similar clause. See GPLv2 section 9 of its terms and conditions.

                                                    1. The Free Software Foundation may publish revised and/or new versions of the General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns.

                                                    Each version is given a distinguishing version number. If the Program specifies a version number of this License which applies to it and “any later version”, you have the option of following the terms and conditions either of that version or of any later version published by the Free Software Foundation. If the Program does not specify a version number of this License, you may choose any version ever published by the Free Software Foundation.

                                                    1. 6

                                                      Most of the license conflict isn’t some ideological difference, but a very practical difference. GPL-licensed code can only be used by software licensed under a GPL-compatible license, meaning roughly that it can only be used by software with a license that has the same or fewer restrictions than GPL. That means using GPL code from a MIT- or BSD-licensed project is no problem, but using GPL-licensed code from a project with a more restrictive license (or from proprietary software) is prohibited by the GPL.

                                                      The CDDL has extra restrictions on what you can do which makes it not GPL-compatible. That means ZFS isn’t allowed to use GPL’d code. Linux is under the GPL, and you can’t be a filesystem inside the Linux kernel without calling Linux functions. So ZFS can’t be integrated into Linux until either ZFS moves to a GPL-compatible license or Linux moves to a permissive license, neither of which is ever gonna happen.

                                                      If the challenge was only that some people who work on Linux had some ideological aversion to the CDDL I bet it’d have been integrated into the kernel a long time ago.

                                                      This is my understanding of the situation from reading stuff on the Internet. There may be inaccuracies. I’m not a lawyer. Also, different lawyers have different views on these things, Canonical’s lawyers for example think it’s okay to have CDDL-licensed code as a separate kernel module that’s loaded at runtime, and that the only problem is having GPL-licensed code in the statically linked kernel image itself.

                                                      My personal view of these things is that ZFS is a kinda cute proof of concept which is never going to be actually relevant due to its license, and that I hope BTRFS will eat its lunch. I think it’s sad that so many great open-source filesystem engineers are putting so much excellent work into a dead project.

                                                      1. 6

                                                        Except ZFS already works well and btrfs doesn’t give much hope of catching up this decade…

                                                        1. 3

                                                          That’s true. The most likely outcome is that neither ZFS nor Btrfs will be usable any time soon (or ever), Btrfs for technical reasons and ZFS for intentional licensing reasons. But I think working on the technical problems of Btrfs is more fruitful than sitting around and waiting for ZFS to suddenly not be under the CDDL.

                                                          1. 3

                                                            I would love to help fund a clean rooming of ZFS (need two implementations to make a standard anyway…) But I may be the only person interested in that.

                                                            1. 4

                                                              You wouldn’t be the only person interested but the amount of engineering effort required would be phenomenal. Sun spent a lot of time and money on it prior to release, and it’s had constant engineering effort on it in the decades since.

                                                            2. 2

                                                              The most likely outcome is that neither ZFS nor Btrfs will be usable any time soon (or ever),

                                                              Only if you restrict yourself to Linux. I’ve been a happy ZFS user on FreeBSD for over a decade now. It works out of the box, the installer can set up a ZFS pool, and the you can boot from a ZFS root filesystem.