1. 24

Wondering if any of you have a similar system:

Recently I’ve been keeping archives of my files, pictures, and videos. The idea came from watching Mr Robot, where Elliot keeps a binder of his hacks in DVD form under his bed. A lot of my files are like that: I want to keep them around forever, probably out of some egotistical idea that my grandchildren will be interested; but I don’t really need to look at them day to day.

So: I have a folder on my computer called ~/archive/0next (0 so it gets sorted first), where I build out the next archive I’m working on. When I have files that I don’t actively need anymore, they get moved into the 0next folder. When it reaches ~10-20 GBs I will create a zip or tar file, burn a bluray, and push the file up to my cloud storage, and then delete it all and start over.

Sometimes I get interested in looking at old files, so I’ll go grab the Bluray in question (or pull the archive out of AWS Glacier) and load it into, say, ~/archive/pictures-2015 so I can flip through them.

With this in place, my “working set” of files tends to stay <20GB, which means I don’t need to buy expensive hard drives, and my backup program runs really quickly.

So far the biggest problems are with search and indexing. It’s hard to remember which archive I put things in, so I’m currently writing a script to create a catalog for me.

  1. 6

    I don’t use it, but I’ve watched a lot of tech talks about Perkeep. From the homepage:

    Perkeep lets you permanently keep your stuff, for life.

    Perkeep (née Camlistore) is a set of open source formats, protocols, and software for modeling, storing, searching, sharing and synchronizing data in the post-PC era. Data may be files or objects, tweets or 5TB videos, and you can access it via a phone, browser or FUSE filesystem.

    Perkeep was started by Brad Fitzgerald of Memcached and Golang fame and is actively maintained. Might be worth a look.

    1. 1

      I use perkeep and occasionally contribute.

      I think it’d be quite awkward to get it working nicely with rotating old files onto blu-ray.

      It does work OK with glacier, though - you can set a lifecycle policy on an s3 bucket to rotate files into glacier after some period of time. The search index (separate to your backups) will still be able to find the files, but I think you’ll have to dig them out and make ‘fetch to s3’ requests yourself (should be a straightforward feature to add).

    2. 4

      I have ten years worth of files at ~/doc/archive/yyyy/mm/dd, all collected by running a very simple archive command. I use find and grep to search these, which work well enough to keep me from firing up elasticsearch. About 5 years ago, I moved ~/doc/archive from from local disk to a ZFS NAS with a quick fstab entry and a symlink.

      I rotate encrypted drives offsite for the entire NAS, because sneakernet still has the best bandwidth. But given the archive is append-only, it readily syncs nightly to a bucket which is encrypted (client-side!) with a script using duplicacy.

      Photos get their own yyyy/mm/dd directories, to avoid having to search/filter with the rest of the archive, which turns out to be the organization strategy used by shotwell, so I use it to import and index my photos. I loaded Resilio Sync (for free) on all mobile devices to transfer latest photos over wifi, and use a script to gather all photos from the last N days whenever I fire up shotwell to look at photos.

      I have a separate timestamped file which collects “journal” thoughts and snippets from books/code/chats (and I even have my screensaver write to it when it my screen blanks/unblanks, which happens to be useful when billing hours as a consultant), which ultimately is just stdin appended to a file with a lock to avoid corruption. grep is natural for this, but I most often use less with a keyword search to jump around. If you’re wondering whether that scales, consider the mbox format for email.

      No need for Blu-ray discs, unless you are looking forward to labeling them with your favorite album titles to obfuscate your archive. Not to mention the manual work needed to transfer these to your next epoch of storage. Even if you don’t go for the NAS, my Backblaze B2 bill is just over a dollar per month ($1.35 for about 300GB and the append traffic) and a few thumb drives at $20-50 each will get you covered with redundancy and nearline offsite backup (keep an encrypted thumb drive in your car or at a friend’s home).

      1. 2

        Neat! That’s very similar.

        re:bluray – I don’t find that’s something for obfuscation, it’s just the cheapest and least-maintenance way to store some data for 10-15 years (any other technology would probably need to be migrated in that time-frame). ~75 cents to burn a 25GB disc that will still be readable in 2030 is sort of attractive to me. I do use AWS Glacier as well.

        1. 2

          +1

          re:bluray – I don’t find that’s something for obfuscation

          I was building on your Mr. Robot reference. :-)

        1. 1

          I think they would laugh at my measly gigabytes.

        2. 2

          http://git-annex.branchable.com/ is also an option, it is specifically designed to manage large data sets, distributed across varied storage repositories, include glacier, S3, WebDAV etc.

          And you can set policies of how many copies you want of various things, what is available in your “working set” on specific machines, and so on.

          The biggest drawback (for me) is that as it is git based, it doesn’t play nice with archiving directories that have git projects in them. Or didn’t when I last looked.

          1. 2

            git annex is great software, but you are right about not being able to store git in it.

            I use a combination of git annex and borg backups currently. Am slowing designing/thinking about a tool to replace both these use cases.

          2. 2

            I am of the opinion that nothing of value (and even that is debatable) I create won’t live in a git archive.

            So my strategy is: have a home dir full of stuff and every once in a while (5? 10?) years I’ll move it out of reach, and could look up stuff. I don’t think I’ve needed this more than 3 times in my digital life (1994-2019). Of course I do have backups, because I access stuff of the recent few years. But overall? I don’t think I have the urge to solve this problem in a well thought-out way.

            1. 2

              I have been concerned with digital archiving for a while. I used to use a private WordPress instance for taking notes and such. I converted everything to text files in a date based directory structure.

              Recently I’ve been working on north, a tool to help me read, and review, the directory of text files. Future goals are to have it also be able to create new content, and make it easier to link content, and index for searching. All the while keeping the text files and directory structure as simple, and self organizing, as possible; to aid in making long term archiving possible.

              1. 1

                In terms of search:

                https://github.com/oniony/TMSU is pretty decent.

                I would recommend git annex to be honest, it is scary at first, but once you are familiar with it, it is very nice. One thing is supports very nicely is offline storage with redundant copies.