Nice writeup. I’ll note that…
“and given that it is one of three copies (the other two being on the two hard drives I have attached to the Pi)”
…this actually is two copies if using failure models of geography and implementation. The locations and device types are the same. That means the same problem in either category can knock out both of them. The second HD has to be implemented differently (i.e different vendor) to ensure a problem with first won’t hit it, too. Also, they usually need to be in separate locations. I lost all my stuff once due to a triple failure that came from them all connecting to one device in one location.
So, you have two copies if it’s an isolated fault affecting one drive but just one if its an environmental or drive fault affecting both.
Sure - but in 10 years of running home backups, I’ve had 5 drive failures and zero disasters.
I’m pretty happy with having only 2 geographical regions (home and s3) for my backups given how infrequent that kind of issue is.
I’d considered switching to backblaze (cheaper) but haven’t had time to move everything over (it takes a few days to test restoring from backups).
I agree that this is important. I very nearly lost all my drives at once a few years ago, in a house fire, and I’ve been telling people ever since that it isn’t as unlikely as they think. For the first month after the fire, it wasn’t clear whether I’d be able to recover any data, although I ultimately didn’t lose anything.
I encourage everybody to have at least one off-site backup.
If you’re using btrfs, I think it makes sense to perform a snapshot of your filesystem and then running the backup program against the snapshot so that every file in your backup is from a consistent time.
My backup strategy is simple. All my non code files live on my nas. They get encrypted at rest and synced to s3 (using aws s3 sync) every morning. I have s3 versioning enabled (with a rolling window) in case I screw up and accidentally delete or overwrite a file and don’t realize it until days later.
aws s3 sync
I pay a few bucks a month for s3. I dunno the cost of my nas in terms of electricity. Certainly the build itself blew the OP’s budget though. But I would have the nas even if I didn’t backup anything.
I think the only hole in my plan right now is my encryption keys. I have them printed out, but stored in my house. So if my house burned down I would be SOL. I should probably fix that.
Has anybody else seen issues with Duplicity running out of memory? I’ve been backing up ~100GB without issues on a machine with 8GB RAM.
Just a note – your scenario is nearly two orders of magnitude off of what I was observing (100GB vs 600GB data, and 8GB vs 1GB ram), and I didn’t see the error at first (did a full backup, and several incremental additions before it ran into trouble). If it’s linear (for software that’s been around so long, I would be surprised if it were worse than that), then for you to run into the same problem you would need an archive that is about 4.8TB.
Do you happen to remember what block size you used? The default of 25MB is way too small, and I believe signatures for all blocks are required to reside in memory.
I just looked at the config file I had (for the duply frontend), and there wasn’t anything for block size, so presumably it was set to the default? That could explain it, though I think there is a little more going on, unless signatures for all individual files also are in memory (as if it were just on every 25MB block, 600GB is only about 25,000, and each signature would have to take up about 36K of memory for it to use up the 900+MB that I was seeing).
I just downloaded the duply script from http://duply.net, and it does look like the default is 25MB. One correction to my comment: the terminology is “volume” (--volsize) not “block”.
You’re right that it would have to be saving more metadata per volume for my hypothesis to bear out.
Can you elaborate on what operations were running out of memory?
The actual operation was it unpacking the signatures file from a previous backup (and it retried it I think 4 times, each time running for a while, gradually using up all available memory, before giving up). I think I was just trying to make a new incremental backup. I had made one full backup and several incremental backups, and had just added a bunch of new files and was trying to make a new incremental backup.
Was it on a new machine or anything like that? I’m wondering if I should retry a backup after blowing away my signature cache.
Thanks a lot for answering these questions! A potential issue with my backups is extremely worrying.
Nope, on the same machine, but it had been wiped and reinstalled at least once (so it’s possible that library versions had changed, and perhaps the memory efficiency of some of them got slightly worse). It’s pretty confusing, because previous incremental backups had worked. The upside with duplicity is that in a catastrophe, you pretty much don’t even need duplicity itself to restore (tar and gpg and a bunch of tedium should do it). :)
I used a raspberry for this until I took a hard look at what I was storing. Became more digitally minimalistic and have been happy using a private gitlab repo with git-secret for encryption.
I’m using SyncThing at home. Just mirror and sync a folder across multiple machines.
One downside I see is the lack of storage somewhere else while all laptops are at home. Geographic risk.
It also requires all machines to store the full state. ~100GB in my case.
One popular differentiation between file synchronization and backups are that you can travel back in time with your backups. What happens if you - or more realistically: software you use - deletes or corrupts a file in your SyncThing repository? It would still be gone/corrupted and the problem would automatically be synced to all your machines, right?
Personally I use borgbackup, a fork of attic, with a RAID 1 in my local NAS and an online repository to which I, honestly, don’t sync too often because even deltas take ages with the very low bandwidth I got at home, so I did the initial upload by taking disks/machines to work …and hope that the online copies are recent ‘enough’ and I can’t really resist the thought that in scenarios where both disks in my NAS and the original machines are gone/broken (fire at home, burglaries, etc.) I would probably loose access to my online storage too. I should test my backups more often!
I use Borg too! At home and at work. I also highly recommend rsync.net, who are not the cheapest, but have an excellent system based on firing commands over ssh. They also have a special discount for borg and attic users http://www.rsync.net/products/attic.html
Hmm - that’s really not the cheapest!
3c/gb (on the attic discount) is 30% dearer than s3 (which replicates your data to multiple DCs vs rsync.net which only has RAID).
Eh - I don’t mind paying for a service with an actual UNIX filesystem, and borg installed. Plus they don’t charge for usage so it’s not that far off. Not to shit on S3, it’s a great service, I was just posting an alternative.
Yeah that’s fair, being able to use familiar tools is easily worth the difference (assuming a reasonable dataset size).
True, though S3 has a relatively high outgoing bandwidth fee of 9c/gb (vs. free for rsync.net), so you lose about a year of the accumulated 0.7c/gb/mo savings if you ever do a restore. Possibly also some before then depending on what kind of incremental backup setup you have (is it doing two-way traffic to the remote storage to compute the diffs?).
Ahh, I hadn’t accounted for the outgoing bandwidth.
That said, if I ever need to do a full restore, it means both my local drives have failed at once (or, more likely, my house has burned down / flooded); in any case, an expensive proposition.
AFAIK glacier (at 13% the price of rsync) is the real cheap option (assuming you’re OK with recovery being slow or expensive).
RE traffic for diffs: I’m using perkeep (nee camlistore) which is content-addressable, so it can just compare the list of filenames to figure out what to sync.
That’s one huge advantage of Resilio Sync. You don’t have to store the full state in every linked node. But until RS works on OpenBSD, it’s a no-go for me.
syncThing is awsome for slow backup stuff. But i wish i could configure it such that it checks for file changes more often. Currently it takes like 5 minutes before a change is detected which results in me using Dropbox for working directory usecases.
You can configure the scan time for syncthing, you can also run syncthing-inotify helper to get real-time updates
As one more option, there are some arm NAS units that more or less “just work” with regular Linux once installed (generic arm-openmp kernel, distro/bootloader integrated kernel updates). I have an Netgear RN104 with Debian on it.
Power consumption is apparently 12W idle, though, so higher than the numbers for Pi 3. But for this you get a compact enclosure, native SATA performance, native gigabit Ethernet.
Great post! I’ll definitely pick some ideas here and there.
I currently use tarsnap for backing up my servers configs, git repos and web content (~3G only). 5$ was enough to hold my data for 1 year and a half though.
For my personnal data, I have two separate disks which I mirror with rsync, one of them behind a USB one that I plug in occasionally. I’m definitely not satisfied with it and I’m working on building a NAS out of a cubox-i + external USB RAID enclosure. I must first ensure the cubox-i can run OpenBSD properly though. I’d use a remote server of mine for external backup copies, and I’d like to find some people (once my setup is ready) to build some sort of community backup (storing copies of other people’s backup so they will store yours).
All I need now is to force myself into it :)