1. 18
  1. 8

    As the author admits, they didn’t really have a second backup - they effectively had a (very slow) RAID1 mirror. Fortunately for them, the Time Machine backup was a real backup & saved their data this time!

    1. 4

      I use restic with B2 instead of rclone. Restic takes care of backup versioning, and I keep old backups according to a policy of daily for the last week, weekly for the last month, monthly for the last year, etc.

      You also get the benefit of backups being encrypted (good for ensuring privacy from Backblaze, bad if you wanted to use the B2 web interface to download data) and restic’s built-in index and data integrity checks for remote backup integrity.

      And I’m no cryptographer, but @FiloSottile does have a blog post about its cryptography from 2017.

      1. 2

        I actually use both. I use restic from my machines to a central location, then rclone all those backups out to b2. That gives me local copies if I need to restore, but also the option for restoring from any snapshot out of b2 as well.

        1. 1

          Do you worry about corruption of the restic repo when using this setup?

          I’ve considered something very similar to this for my personal setup, but worry that the restic repository might break for some reason, after being uploaded with rclone.

          1. 2

            I haven’t worried about that honestly. I’m using rclone sync, which if is corrupting files would be quite a bug.

            I’ve done simple testing of restores by mounting the rclone mount, then mounting the restic repository, and they’ve worked. Though I’ve admittedly not done any large scale restores from the b2 copy.

      2. 4

        I think this post buries the lede - according to @c–, a security update from Apple corrupted user files. This sounds like a way bigger deal than backup strategies.

        1. 2

          It’s hard to believe, isn’t it? I don’t have another explanation though. When I discovered the corruption, I stepped back in Time Machine to find the first backup with corrupted files. It was a backup that ran right after the update occurred. The previous backup contained uncorrupted files. As far as I know, no-one had used the iMac in between those two backups. It’s far from proof, but it’s quite a coincidence. Part of my motivation for posting this was to see if anyone else here had encountered anything like it.

        2. 3

          I think the key part is:

          Unfortunately, I did not notice the corruption right away. It was only weeks later when I was going through old photos that I discovered that many of them were now junk.

          The old trope is “backups don’t exist unless you’ve checked you can use them”. But this is insidious, what if the backup system works but the user corrupted their data? The author’s solution of monitoring for too many changes is a good approach.

          But maybe the idea of backup verification should be extended further, try to verify images are still “valid” images? This seems useless, images are mostly data with a magic header and small frame headers. (Are PNG checksums applied to the data or just frame headers? Maybe this is a neat way out of the problem).

          I think perhaps the realistic answer is to have different retention periods depending on the sensitivity of the data. The author admitted they are cost sensitive but the photos are precious. Hence any backups of precious data should be append-only for data and never delete old versions. But in my chosen backup system Borg it’s hard to have granular retention periods.

          1. 3

            To catch issues like this sooner, I have automated your manual step. Instead of running a manual dry-run, looking through it for unexpected changes, and then running a manual real-run, I have my backup script email me the list of files that changed since the last backup as it does the backups, no manual step required. This lets me peruse the email when I have a free minute or two, and it doesn’t take long since this happens daily and the files that SHOULD have changed are fresh in my mind. And if I notice anything suspicious I can go check things out and restore from a previous version if anything is amiss. Seems like the right balance to me.

            1. 1

              That’s a nice idea, thank you. Various cron jobs already email me from my servers, but it didn’t occur to me to try to email myself from the iMac. That would be easy enough and, as you say, a better balance between automated and manual.

            2. 1

              This feels like a job for tripwire or aide, both for detection and for repair, with an additional script looking for hash changes without corresponding time stamp changes.

              1. 1

                Tripwire is a nice tool but I don’t think it would give me anything here that I don’t already get from rsync, would it? And rsync leaves me with another off-site backup, as explained at the end of the post.

                Please could you explain how your additional script would work? Why would there be hash changes without timestamp changes?

                1. 1

                  I expect corruption (or buggy OS file code, possibly flaky memory, etc.) to look like some bits on the disk flipping, rather than like a photo app editing the image and updating the timestamp. So, different checksums, same timestamps.

                  To compare the backups right before and after the upgrade, record timestamps and checksums, by relative path - aide builds a “database” that effectively can be a CSV: path, timestamp, checksum. Run it from the “before” and “after” Time Machine backup paths. Sort the lists by path and diff them. Another script could parse the diff and list the items with file changes and no corresponding timestamp change. I’d expect that to list corrupted files.

                  Caveats:

                  This all relies on command line access to the full Time Machine backups, which could be, uh, non-trivial; at least, on my networked backups, even an ls was blocked. iTerm2 has Full Disk Access enabled, and sudo doesn’t pick it up either. Maybe giving an interpreter FDA would work.

                  I’m not sure aide does relative paths, but it’s a relatively trivial script to replace.

                  1. 1

                    I expect corruption (or buggy OS file code, possibly flaky memory, etc.) to look like some bits on the disk flipping, rather than like a photo app editing the image and updating the timestamp.

                    Ah, that’s the problem here. Checksums (as used by tripwire, AIDE, ZFS, btrfs, etc.) are great for detecting abnormal bit flips, but can’t detect buggy or malicious software overwriting files. I don’t really know exactly what caused the corruption in this case - it seemed to happen at the same time as an OS upgrade but I don’t know how - but in the general case, I want to protect against buggy/malicious software.

                    1. 1

                      Checksums (as used by tripwire, AIDE, ZFS, btrfs, etc.) are great for detecting abnormal bit flips, but can’t detect buggy or malicious software overwriting files.

                      I’m missing something. Checksums are good at detecting any changes to file contents, big or small. The file contents appear to have changed, right?

                      Edit: I sketched out how to detect corrupted files from Time Machine, working toward copying them back. It’s a detector, not a preventative, for corruption. Something similar could be used for your main drive, and automated.

                      1. 1

                        I phrased that poorly. Checksums can of course detect any change to file contents, but they can’t tell the difference between a legitimate change (e.g. me opening a photo in a photo app, making a deliberate edit, and saving the file) and an unwanted change (e.g. buggy or malicious software overwriting a file’s contents).

                        So we can automate the detection of all changes, but we can’t automate the detection of unwanted changes.

                        1. 1

                          I added to the parent comment, btw. My timing was bad, as it landed after your reply, and I’ll try to remember to reload the page.

                          There are at least three categories of changes coming up:

                          • OS or hardware flakiness
                          • malicious software
                          • other buggy software

                          In picking them apart, I’d probably try to note all changes. File changes without updating the timestamp seem like the first two, like, malicious software being sneaky; so, I’d prioritize looking at those, with notifications or mail.

                          1. 2

                            Yes, your method does provide more information than just “something has changed” - I’d agree that a file change without an updated timestamp is likely a sign that something has gone wrong. It would make a good addition to the suggestion by @jjnoakes to email changes from every backup: I could highlight those cases in the email. Thanks for the idea!

                            I guess what surprised me about all this is that I think I’ve only ever seen those first two categories discussed before. For example, people talk about using ZFS or btrfs checksums to detect bit rot; or using something like tripwire to detect malicious changes to configuration files. But I’ve never before seen much discussion of buggy software making unwanted changes to files that are otherwise expected to change.

                  2. 1

                    Separately, aide could offer historical records of file changes. If it ran nightly, or you for example ran before-and-after upgrade checksums, then if and when you noticed issues you could take a look retroactively and nail down the time easily, regardless of whether the corresponding backups had been pruned.

                2. 1

                  Clever file systems such as ZFS wouldn’t really help here, as ZFS wouldn’t know the difference between me deliberately changing a file and a macOS upgrade writing crap all over my data. Only I know which files I meant to change and which I didn’t.

                  That is missing the point of ZFS as part of a backup strategy. ZFS provides O(1) snapshots and a way of easily backing up the difference between two snapshots. I can back up with ZFS and have it keep a gradually expiring set of snapshots for live recovery (for example, every 10 minutes for a day, daily for a month) and I can stream the weekly ones to remote storage. I’m 90% of the way through writing a backup tool that uses Azure append blobs for the off-site bit so that even if my NAS is compromised it isn’t able to tamper with older the backups.

                  The more interesting thing here is what happens if you get a bit flip in your encrypted incremental backups: data corruption in the encrypted blob would prevent a restore of that snapshot’s send stream, which would then prevent restore of any subsequent send stream. For v1, I’m happy to just trust the cloud provider’s reliability guarantees, but longer term it’s probably important to add some forward error correcting codes along with the streams.

                  1. 1

                    I don’t think I’m missing the point (and as noted later on in the post, I am a fan of ZFS snaphots). Here I am just pointing out that checksums don’t help with the problem that I’m talking about, i.e. detecting data corruption caused by malicious or buggy software.