1. 34
  1. 10

    Nice article, but it’s too bad because I was really looking forward for the “Actually storing data forever” (or at least a very long time) part, and the author dodges the bullet by saying “oh well it’s a broad question”.

    That would be really interesting to have a take on “how to store a particular music recording forever”, for example.

    1. 18

      Author here. I reckon that I did answer this, in the last section of the article. The answer is: you need to have a continuous chain of real human beings from here to your destination to look after the data. What use is it if the data is in a langauge which has been dead and forgotten for 10,000 years? Or if the contemporary stewards of the archive just stop caring about it? It’s not a sexy answer like “etch it in an asteroid” or “encode it in the DNA of a plant and let it spread”, but those solutions are just sexy - not effective. What happens if the asteroid is flung out of solar orbit, or if the plants go extinct? The data needs ongoing, intelligent maintenance.

      1. 2

        There is a theory which states that if ever anyone discovers exactly what the Universe is for and why it is here, it will instantly disappear and be replaced by something even more bizarre and inexplicable. There is another theory mentioned, which states that this has already happened.

      2. 3

        Tape is considered the best. Lasts a long time, generations of product are designed for backward compatibility, and damage to data is more localized. This link mentions advantages and disadvantages. Apparently, there’s been some improvements in accessing data, too.

        High-quality paper that is sealed from the elements works pretty well, too. I thought about going back to punch cards or using barcodes for low amounts of critical data. Turns out, there was a startup that was backing up people’s data on paper using giant rolls and printing press like newspapers use. It was a trip.

        Lastly, you can use a mix of HD’s and optical media. The optical media is there because it’s not susceptible to electromagnetic interference in its storage. Just basically have multiple HD’s in multiple places. You script something that lists all the files with their hashes. Periodically, you check hashes on each one to see if anything screwed up. A voting algorithm fixes that. The logistics of checking and rotation is something I’m leaving to others to work out since it varies by use case. Just make sure they’re different kinds of HD’s to avoid similar failures.

        You can also do the above with cloud vendors if you want to pay the storage fees instead of do the logistics. If sourcing ethical providers, I was looking at Tarsnap and Backblaze (for their open hardware). Remember that many VM companies sell storage with their VM’s. Could use a bunch of $5 VM’s scattered around the world on top of your local copies. Then, you can use good companies like Prgmr, DreamCompute, SiteGround, and Pair. The component connecting to those and checking them should itself be rock-solid (e.g. OpenBSD or FreeBSD).

        1. 3

          Gold punched tape. I’ve stored my gpg key on aluminum tape and read them back successfully.

            1. 1

              I have an anecdotal report that laser printed paper, run through a hard lamination machine, will reliably remain legible for about 100 years, provided it doesn’t burn. You can of course also laser engrave or plasma cut some sheet metal, which, depending on the depth of engraving and type of metal and storage environment, could probably be no-maintenance reliable for about 10-100x as long.

              This doesn’t help with larger quantities of data, but could be used for something like a QR code. ddevault rightly raises the matter that 100 years from now there might not be QR decoder software handy, but that’s less of an issue, especially if you include some basic informational text about how to reconstruct a decoder.

              1. 1

                I have an anecdotal report that laser printed paper, run through a hard lamination machine, will reliably remain legible for about 100 years, provided it doesn’t burn.

                Laser printers haven’t been around for 100 years.. so what is this anecdote based on?

            2. 4

              I agree with Drew that an intelligent being in the loop is necessary - and dangerous for all kids of actually long term storage.

              Regarding the technical parts:

              For business use there there are rules that can be set in Google Cloud (and probably Azure and AWS as well) to prevent deletion, but I guess of you don’t pay your bills it goes away anyway.

              Personally I’ve been considering getting an mdisc compatible DVD burner and some mdiscs for photos, and I’d be happy to know if someone here has looked closer at that option and would care to share their findings.

              1. 4

                it’s a sad commentary on my life, but nothing that I’ve created or obtained needs to be stored forever

                1. 2

                  Really I think it comes down to “what is important to me and why”.

                  I want to store pictures of past events to look back on, as reference to my life now, to see what has improved or regressed.

                  I want to store documents required by law or act as proofs.

                  I want to store media that is appealing for entertainment until I’m dead. This is luckily being offloaded by services like Spotify, Netflix, etc (yes I’m aware of the various downsides they still have).

                  I want to store content I’ve expressed, again as a reference to myself in the past.

                  And that’s about it for me.

                  I gotta get back into understanding QR codes. I think laminated paper digital storage is the only long term solution.

                  1. 1

                    I think more that it’s a sad commentary on society and the stories we tell ourselves that so many people think otherwise.

                  2. 2

                    I’m surprised that tapes haven’t been included into this piece with some serious consideration. I don’t have direct experience but that should offer some serious lifespan.

                    1. 2

                      What exactly is the piece of hardware pictured at the end of this article, labeled “In summary: no matter what, definitely don’t do this:”? I don’t recognize it, although if one should never use it, maybe it’s just as well that I don’t recognize it. My wild-ass guess based on the text is that it’s an array of microSD cards, but I expect to be wrong.

                      UPDATE: Hey, I was (mostly) right! A reverse image search reveals it is in fact an SSD constructed using a RAID 0 controller for up to 10 SD cards. https://hackaday.com/2018/02/12/worlds-stupidest-solid-state-disk-drive-hack/

                      1. 1

                        I enjoyed this article. I don’t maintain data commercially, but I have my own personal datasets (photos, code, backups, etc.) that I try to maintain. I hope that the approaches described here (for example, being notified when everything is “OK” so that the lack of notifications may indicate failure) gain more traction, so it’s easier to maintain personal data for >10 years.

                        Borg backup also looks interesting. I should do some more research into that…

                        1. 1

                          I got a LTO4 tape drive on a whim, but I haven’t had any luck figuring out how to find an appropriate SAS controller to actually use it. Makes me sad looking at it sitting on the shelf.

                          PS. For the curious, it seems to be a SFF-8482 connection but I’m lost on what to do with this information.

                          1. 1

                            Oh wow, LTO4 tapes are cheaper than I thought!

                            This IEEE piece on tape storage a few months ago made me consider buying a tape drive for the first time, especially after reading about the surprising spec’d failure rates of both spinning disks and SSDs in @dl’s Deconstruct talk on why filesystems are hard to use correctly.

                            1. 1

                              The tapes may be cheaper for the capacity but the hardware to read and write to them are still prohibitively expensive for casual users.

                              1. 1

                                I will look into cheap SAS controllers from China (ie ebay).

                          2. 1

                            Forever is too strong a word. The article doesn’t even consider measures against EMP weaponry.

                            1. 4

                              See section titled “Human failures and existential threats”

                            2. 1

                              I have a small(ish) - several hundred megabytes - encrypted and split tar file of my wedding photos up on alt.binaries.boneless. It, and associated par2 files, have been up there for ~5 years with no problems.

                              Based on current Usenet retention I should have at least another year or two of storage before I will want to re-upload it. And on current Usenet prices I’m spending essentially around twenty cents a year for storage (a $2 block account / 10 years). For a globally distributed data store, that’s probably impossible to beat.

                              Usenet backups are absolutely worth a look if you want to store things for a long time. I wouldn’t bet on forever, but.

                              1. 1

                                One fun option for super long term storage is a high density encoding printed to paper: http://ollydbg.de/Paperbak/ Since the data format can be printed in human readable format as well, any future person who needed to read the data would just need to be able to scan the paper and implement the decoder. With compression you can achieve ~3MB of text per double sided paper apparently.

                                1. 1

                                  Thinking about reducing risks for long-term storage with sharded data, there’s another dimension where the more randomly your data is replicated, the higher the chance of loss when any <replication factor> arbitrary nodes die. This can be reduced by restricting the possible replica destinations for groups of shards: https://hackingdistributed.com/2014/02/14/chainsets/