1. 79
  1. 37

    I saw this yesterday and I strongly suspect that this is a result of designing the drive firmware for the filesystem. APFS is a copy-on-write filesystem. This means that, as long as you commit the writes in the correct order, the filesystem remains in a consistent state. As long as the drive has sufficient internal battery power to commit all pending writes within a reorder window to persistent storage, then you don’t lose data integrity. Most Linux filesystems do block overwrites as part of normal operation and so this is not the case and you need journals along the side, along with explicit commit points, to guarantee integrity.

    1. 16

      As per this tweet, using F_BARRIERFSYNC to get writes in the correct order on macOS is also very slow, so this would very much be a problem for APFS too.

      1. 4

        F_BARRIERFSYNC

        Interesting! Does any other operating system have that?

        To implement a transaction, you need a write barrier. Nothing more; nothing less. Be it a CoW filesystem doing its thing, the OS atomically renaming a file, or an application doing the read-modify-update dance when you save, you need a way to ensure that the new data is written before you point to it. That is the fundamental problem with transactions, no matter which layer implements it. Only that way can you guarrantee that you either get the old data or the new data – the definition of a transaction.

        Waiting for the first part of the transaction to finish before you even submit the final write that completes it is such a poor man’s write barrier that you would think this would have been solved a long time ago. It obviously kills performance, but it does nothing for “durability” either: Compared to communicating a write barrier all the way down, the added latency just reduces the chances of saving your data before a power loss. If you care about hastily saving data, you can of course also fsync the final write, but that is a separate discussion: You could do that anyway; it’s irrelevant to the transaction. I think a write barrier could and should replace regular fsync in 999‰ of cases, on every OS.

      2. 6

        As long as the drive has sufficient internal battery power to commit all pending writes within a reorder window

        which is tricky for desktops running this type of hardware because they not hot have a battery.

        Then again, given the success of the M1 desktops and the lack of people complaining about file corruption, I have a feeling that at least under macOS, this is a theoretical issue at which point, why not be quick at the cost of no practical down-side.

        1. 19

          Usually the battery is internal to the drive. It typically needs to be able to supply power to the drive for under a second, so can be very small.

          1. 20

            In the replies it is shown, that there seems to be no last-ditch attempt to commit the pending writes back to flash, as plugging the power out of a Mac Mini results in data loss of the last few seconds of written data, so it appears that the drive has no sufficient internal battery power, which leaves you with a system that doesn’t have data integrity by default.

            1. 8

              Consumer grade SSDs generally don’t have anything to do last-ditch flushes, I have definitely seen uncommitted ZFS transactions on my WD NVMe drives.

              The only integrity problem here is that macOS does fake fsync by default, requiring a “FULLSYNC” flag for actual fsync.

              1. 14

                A thing I learned from the thread is that Linux is actually the outlier in having standard fsync be fullsync. FreeBSD I guess has the same problem as macOS as it is permitted by POSIX.

                What marcan42 was most shocked by was the disk performance. He was only able to get 42 IOPS with FULLSYNC, whereas Linux usually does FULLSYNC all the time and gets much better performance.

                Either (a) all drives except for Apple’s lie about what happens in FULLSYNC and don’t do enough, or (b) something is wrong with Apple’s drives.

                1. 14

                  FreeBSD did fsync correctly long before Linux fixed theirs; so did ZFS on all platforms. Correctly meaning not just waiting for the flush but also checking that the flush itself actually succeeded (other OSes not checking is what took Postgres devs by surprise there).

                  or (b) something is wrong with Apple’s drives

                  Yeah, marcan’s suspicion is that they didn’t optimize the controller firmware for that case because macOS cheats on fsync so it never came up.

            2. 1

              If that’s true and a battery/large cap is on the board/controller/ssd, then the initial complaint is a bit overblown and full honest-to-god fsync really isn’t necessary?

            3. 5

              lack of people complaining about file corruption

              APFS is CoW so of course those complaints would be very unexpected. What would be expected are complaints about the most recent FS transactions not getting committed due to macOS doing fake fsync by default (requiring a “full sync” flag to really sync). But either nobody runs high-durability production databases on their Macbooks (:D) or all those databases use the flag.

              1. 21

                They do. SQLite uses F_FULLSYNC and has done so since it was first incorporated into macOS in 2004. LMDB uses it. CouchDB does. I know Couchbase does because I put the call in myself. I would imagine any other database engine with Mac support does so too.

                1. 1

                  Hmm. Then I’d imagine they’re seeing performance dips on M1 as well, right? I wonder how they’re dealing with that—treating it as a bug or just an unavoidable regression.

                  1. 4

                    I work with SQLite a lot in my day job and haven’t noticed a regression on my M1 MBP (quite the opposite, really.)

                    It’s always been important when working with SQLite to batch multiple writes into transactions. By default every statement that changes the db is wrapped in its own transaction, but you don’t want that because it’s terrible for performance, on any filesystem. So for example my code handles streaming updates by batching them in memory briefly until there are enough to commit at once.

                    1. 4

                      It’s always been important when working with SQLite to batch multiple writes into transactions.

                      Yes. Transaction speed is a very limited resource on many storage devices. The SQLite FAQ says it well: If you only get 60 transactions per second, then yes, your harddisk is indeed spinning at 7200rpm.

                      my code handles streaming updates by batching them in memory briefly until there are enough to commit at once.

                      Nice! I wish all developers were that responsible. I fondly remember having to clone a codebase to /dev/shm to be able to run its SQLite tests in reasonable time before I had an SSD. When you have a mechanical harddisk, it becomes loudly evident when somebody is abusing transactions. That was also before SQLite got its new WAL (write-ahead log) transaction mode that can supposedly append commits to a special *.wal journal file before merging it in with a normal transaction. Have you tried it? It sounds like it would do much of the same as you do in terms of fsync load.

                      1. 4

                        WAL is a big improvement in many ways — besides being faster, it also doesn’t block readers during a transaction, which really improves concurrency. But I think it still does an fsync on every commit, since even an append-only log file can be corrupted without it (on some filesystems.)

            4. 3

              Is apfs actually guaranteed CoW-only? Other CoWs do optimisations where some updates will write in place where it’s deemed safe. Only the log filesystems guarantee no updates if I remember correctly.

              1. 7

                apparently no:

                [APFS lead dev] made it clear that APFS does not employ the ZFS mechanism of copying all metadata above changed user data which allows for a single, atomic update of the file system structure

                APFS checksums its own metadata, but not user data […] The APFS engineers I talked to cited strong ECC protection within Apple storage devices

                Since they do not have to ensure that the whole tree from the root down to every byte of data is valid, they have probably done precisely that kind of “optimization”. Welp.

                1. 3

                  Oh… that makes me sad.

                  But thank you for the link, this is a quote I love for a few reasons:

                  For comparison, I don’t believe there’s been an instance where fsck for ZFS would have found a problem that the file system itself didn’t already know how to detect. But Giampaolo was just as confused about why ZFS would forego fsck, so perhaps it’s just a matter of opinion.

              2. 3

                Do you mean people ought to use a copy-on-write filesystem on Linux and all performance gains (and integrity risks) are gone? Or is it not that simple?

                1. 3

                  You can read up on nilfs v2 or f2fs and look at some benchmarks to get an idea of where things stand. They’re both CoW, but don’t bring in the whole device management subsystem of btrfs or ZFS.

                2. 2

                  marcan manage to trip apfs data loss and data corruption :-/

                  1. 5

                    Data loss but not data corruption. The GarageBand error came from inconsistent state due to data loss. I didn’t see anything indicating any committed file data was corrupted, but navigating Twitter is a nightmare and I may have missed some tweets.

                    1. 3

                      If GarageBand saves enough data to presumably partially overwrite a file and leave it in an inconsistent state is that not corruption?

                      That said it seems more an APFS problem as they should know that the fullsync call needs to be made. It’s not ok to skip doing it just because the hardware is apparently absurdly slow :-/

                      1. 3

                        That wasn’t my interpretation. It sounded like there were multiple files involved, and one of them was missing, making the overall saved state inconsistent but not necessarily any file corrupt.

                        The tweet:

                        So I guess the unsaved project file got (partially?) deleted, but not the state that tells it to reopen the currently open file on startup.

                        So probably the project file wasn’t synced to disk but some CoreData state was. I’m not saying APFS cannot corrupt files, it probably can, but I don’t see any strong evidence that it does in these tests. This sounds like write loss to me.

                3. 23

                  Every generation has to discover which operating systems cheat on fsync in their own way.

                  1. 7

                    Hello, $GENERATION here, does anyone have historical examples or stories they’d be willing to share of operating systems cheating on fsync?

                    1. 14

                      Linux only start doing a full sync on fsync in 2008. It’s not so much “cheating” (posix explicitly allows the behavior) as much as it is “we’ve been doing it this incomplete way for so long that switching to doing things the correct way will cripple already shipping software that expects fast fsync”. Of course the longer you delay changing behaviour, the more software exists depending on the perf of an incomplete sync…

                      The real issue marcan found isn’t that the default behaviour on macOS is incomplete. It’s that performing a full sync on apple’s hardware isn’t just slow compared to other nvram, it’s that the speed is at the level of spinning disks.

                      1. 2

                        It’s funny, I have 25+ years of worrying about this problem but I don’t have a great reference on hand. This post has a bit of the flavor of it, including a reference from POSIX (the standard that’s useful because no one follows it) http://xiayubin.com/blog/2014/06/20/does-fsync-ensure-data-persistency-when-disk-cache-is-enabled/

                        The hard part is the OS might have done all it can to flush the data but it’s much harder to make sure every bit of hardware truly committed the bits to permanent storage. And don’t get me started on networked filesystems.

                        1. 2

                          Don’t forget scsi/eide disks cheating on flushing their write buffers, as documented here So even when your OS thinks it’s done an fsync, the hardware might not. It’s one of the earliest examples I remember, but I’m sure this problem goes back to the 90s. I also remember reading about SGI equipping their drives with an extra battery so they could finish pending flushes.

                          1. 1

                            I remember the ZFS developers (in the early 2000’s, in the Sun Microsystems days maybe?) complaining about this same phenomenon when they loudly claimed “ZFS doesn’t need a fsck program”. Someone managed to break ZFS in a way that made a fsck program necessary to repair because their drives didn’t guarantee writes on power off the way they said they did.

                      2. 14

                        on macOS, fsync() only flushes writes to the drive. Instead, they provide an F_FULLSYNC operation to do what fsync() does on Linux.

                        This has been the case since at least 2004, this operation is documented public API, and it’s well known to database developers.

                        There was a lot of discussion of this, internal to Apple and public, when this API was added in 10.4(?). Apple’s filesystem architect Dominic Giampaolo had a good write up, but I’m not sure how to find it. My recollection is that

                        • doing a full, completely safe flush is very expensive because it requires telling the disk controller to flush its caches
                        • most OSs at the time weren’t doing the flush-disk-controller part
                        • Darwin added that part to fsync to address database corruption after panics that was showing up in internal macOS builds (i remember this happening myself in BerkeleyDB and SQLite)
                        • This made fsync super slow and hurt system performance badly
                        • They compromised by rolling back that change but adding an fcntl operation to do the full flush, for databases.

                        Of course that was aeons ago in terms of OSs and storage technologies, so I’m sure much has changed.

                        I will note that this complaint is by someone working on an unofficial unsupported OS without access to full knowledge of the hardware, so they can’t really make statements about what the SSD can and can’t do or what its “real” performance is.

                        1. 18

                          working on an unofficial unsupported OS

                          The comparison is not done with Linux on the Mac though, read this tweet carefully — the performance drop from F_FULLSYNC on macOS with Apple’s SSD is abnormally awful compared to the drop from the equivalent fsync on a Linux PC with a Western Digital drive.

                          Also, you’re a bit unfair in the dismissal of reverse engineering. Sometimes reverse engineers get to know way more than the manufacturer ;)

                          1. 1

                            That’s interesting. I have some code that makes F_FULLSYNC calls directly (a b-tree manager) and I can try benchmarking it with and without that call.

                            From comments above it sounds like the full-sync may be less crucial on APFS than on HFS. It used to be necessary (in a database) to avoid corrupting the file if there’s a panic/power failure. If in APFS the worst effect without it is that the commit might be missing (but the file is otherwise ok), that makes it optional in many scenarios, for example replicating records from another data source.

                          2. 4

                            I will note that this complaint is by someone working on an unofficial unsupported OS without access to full knowledge of the hardware, so they can’t really make statements about what the SSD can and can’t do or what its “real” performance is.

                            It is not the author’s fault that Apple does not provide hardware manuals to their users.

                            1. 1

                              You’re right, it isn’t, but it’s also not fair of them (or commenters) to draw conclusions about the hardware based on incomplete knowledge of it.

                              1. 1

                                By that reasoning, Apple’s employees would provide the fairest commentary about the hardware, since they have the most complete knowledge of it. But we generally expect vendors to provide the least fair commentary, since they have the greatest potential financial gain.

                            1. 7

                              Here’s Dominic Giampaolo from Apple discussing this back in 2005, before Linux fixed fsync() to flush past the disk cache — https://lists.apple.com/archives/darwin-dev/2005/Feb/msg00087.html

                              Here’s also a TigerBeetle thread I wrote on this last year, with more of the history and how various projects like LevelDB, MySQL, SQLite and different language std libs handle fsync on macOS — https://twitter.com/TigerBeetleDB/status/1422854657962020868

                              1. 3

                                This is great info, thanks!

                                I’m curious why you published it on Twitter rather than a blog/forum? Twitter is really super-awkward to read because everything’s broken up in little pieces, and after 30 seconds or so the page is blocked by a pop-up telling me to log in if I want to keep reading. 🤬

                                1. 2

                                  It’s a pleasure! The thread was originally part of a Twitter poll: https://twitter.com/TigerBeetleDB/status/1422491736224436225

                                  Would love to write this up more when I get time, along with some of our other storage fault stuff over at https://www.tigerbeetle.com — if you’re curious to learn more, you can get a glimpse there and also in our linked GitHub repo of some of the storage fault challenges we’re solving and how.

                                2. 2

                                  People seem to be calling fcntl first and then falling back to fsync if that returns an error. What I do is call fsync first, then fcntl; any idea why that would be suboptimal? I figure it won’t take any longer because when fcntl runs the kernel cache has already been flushed, so it’s not like anything gets written twice.

                                  1. 2

                                    The fsync should only be a fallback in case of fcntl error. It costs an extra context switch to make the extra syscall, and with faster storage devices like NVMe SSD only slightly slower, a context switch means you could be doing another I/O instead. It’s also extra bookkeeping for the OS besides the context switch.

                                    This is speculation from here on, but I’m guessing you might hit some rare edge cases if you swap the order. For example, either the fsync/fcntl might have some intrinsic delay regardless of whether there is data to sync, or might make the fs do something weird. It might even be that macOS has no idea what to tell the device to flush if it’s own page cache has already been flushed. There was a similar bug like this a decade or two back on Linux relating to O_DIRECT, so I imagine it’s not far out that unconditionally combining both fsync and fcntly might be interesting.

                                    Unless there’s a reason not to, I would follow the status quo, so that large projects like SQLite can act as a canary for the same course of action, rather than taking a more interesting path that might never receive as many eyes.

                                3. 6

                                  That’s what you get from vertical integration. If you control the device firmware, the kernel, the filesystem, you can squeeze those things out. Probably means that if you run Linux on your Apple computer you’ll likely need to use a similar-ish filesystem that integrates better?

                                  1. 4

                                    If you want to “squeeze” the same performance as macOS, you would need to do the exact same thing as macOS, which is “blatantly faking fsync calls except ones with a FULLSYNC flag”. Well, no such flag exists on other platforms, but you can get the “faking” part that’s important for performance by doing zfs set sync=disabled your/dataset :)

                                    None of this is dependent on the drive, macOS will always cheat like this and cheating is always faster. The thing with Apple’s internal drive is that not-cheating is abnormally slow on it, compared to other NVMe drives.

                                    1. 4

                                      The problem is that you can, and will loose data from the last few seconds if your power disappears unexpectedly. How much more is that important to you than the speed of your SSD? That’s your choice, but I know which side I’m leaning on.

                                      1. 3

                                        A cool thing with ZFS is that you can implement that fsync-cheating granularly. Keep the default on the datasets where you store important documents and you want to be very sure that all writes are saved even if you don’t have a UPS and the power company does maintenance at the most inconvenient moment. Set sync=disabled on the ones where you do performance-sensitive things with whatever-crap data.

                                        1. 1

                                          You can apply that to any filesystem I believe. You can split out your database mount and use write-back mode, but write-through for the system. Lvm should allow it too.

                                    2. 2

                                      I think we ran into this when trying to fix weird corruptions of leveldb (working on IndexedDB in Chrome). Was a really annoying bug to fix - I’m sure we fullsync too much now… corrruptions are so hard to debug especially on random client machines :/

                                      1. 2

                                        This seems a bit silly to ask, but if I don’t care about data integrity is there a way to make Windows/NTFS work similarly, waiting to flush caches until they’re full (ideally with a timeout…)?

                                          1. 1

                                            Thanks!

                                        1. 1

                                          if you’re e.g. running a transactional database on Apple hardware

                                          Does anybody do this?

                                          1. 8

                                            I would assume that since it’s possible to do that, there’s a non-zero population of people doing that.

                                            1. 8

                                              CoreData used across many Mac apps is using SQLite internally.

                                              1. 4
                                              2. 3

                                                SQLite is ubiquitous on Apple devices. Almost anything that needs to store structured data uses it, either directly (Mail.app, or Cocoa’s URL cache) or as the backing store of Core Data.

                                                1. 3

                                                  Yes but if you look into it (even comments here in this post), sqlite doesn’t suffer from the problem and they do full sync.

                                                  1. 4

                                                    And on consumer devices, I’d assume that the speed problem isn’t as critical as if you were doing, idk, prod SaaS db with tens of thousands of concurrent users touching things