1. 15
  1.  

  2. 5

    My experience with my own data is that almost all of it already has its own compressed file format optimised for the kind of data it is. E.g. JPEG for photos, MP4 for video, etc. Adding another layer of compression to that is not only a waste if time but often makes the dataset slightly bigger. Text, program binaries, and VM images could be compressed for archival storage but consider the fact that storage has gotten ridiculously cheap while your time has not. But if you really want to archive something just pick a format that’s been around for decades (gz, bzip2, xz) and call it a day.

    1. 4

      “consider the fact that storage has gotten ridiculously cheap while your time has not. “

      This is true for middle class folks and up. Maybe working class folks without lots of bills. Anyone below them might be unable to afford extra storage or need to spend that money on necessities. The poverty rate in 2017 put that at 39.7 million Americans. Tricks like in the article might benefit them if they’re stretching out their existing storage assets.

      1. 4

        Consider that a 1TB hard drive costs $40 - $50. That’s $0.04 per gigabyte. Now say you value your time at $10 an hour. Even one minute spent archiving costs more than than a 1GB of extra space, and the space saved is unlikely to be that much. If you don’t have $40 - $50, then of course, you can’t buy more space. That doesn’t mean space isn’t cheaper than time. It’s just another example of how it’s expensive to be poor.

        1. 1

          One other thing to add to the analysis is that one can burn DVD’s while doing other things. Each one only accounts for the time to put one in, click some buttons, take one out, label it, and store it. That’s under a minute. Just noting this in case anyone is guessing about how much work it is.

          The time vs space cost still supports your point, though. Anyone that can easily drop $40-100 on something is way better off.

      2. 3

        Adding another layer of compression, especially if it’s the same algorithm, often won’t shrink the file size that much. However, it does make it very convenient to zip up hundreds of files for old projects, freelance work, and have it as a single file to reason about.

        I would not be so cavalier with the archive file format. For me, it is far more important to ensure reliability and surviveability of my data. Zipping up the files is just for convenience.

        1. 4

          That’s why there is tar, which, by itself, doesn’t do any compression.

          1. 1

            I was thinking that tar suffers from the same file concatenation issue that affects other SOLID container formats. But it looks like cpio can be used to extract a tarball skipping any damaged sections.

          2. 1

            A benefit of zipping together files is that it makes transferring the zipped archive between machines/disks much easier and faster. Your computer will crawl at writing out a hundred small files, and one equally-sized file will be much faster.

        2. 4

          Great shout on par2. Zip your documents, encrypt them, build some parchives and an NZB and upload them to alt.binaries.boneless.

          You get multi-years of distributed storage, recurring billing, and if you already are a Usenet user with an unlimited plan it’s a zero cost solution!

          1. 1

            NO recurring billing, that should say.

          2. 2

            As I’ve become more active with Z80 homebrew computing, I’ve been going back and looking at old CP/M software. A lot of them are in some custom archival or compression format, and then I have to go back and figure out first what software was used, and then I have to find it and hope I don’t run into a dead end. It’s always a delight to run across something that was archived with LZMA or ZIP; I can even work those files on my Linux machine. Kind of anecdotal evidence in support of the article.

            I assume that gzip will be the same way in the future, though you might not want to use it for reasons outlined in the article.

            1. 2

              How many different old archive formats have you run into? Are they easy to find tools for, or to reverse engineering them?

              1. 1

                I’ve run into a handful of them; most of them I wasn’t sufficiently interested to look too much further into them, but a few hours of Googling didn’t turn up the tools themselves.

            2. 1

              I wrote a tool called DACT ( http://dact.rkeene.org/ ), which was not originally designed for archival compression may be a good option when combined with tar.

              Some reasons why this is:

              1. It splits the input file into a bunch of blocks and compresses (and verifies that the compressed data can be decompressed, optionally) each one individually (possibly using a different compression algorithm for each, but you can pick); tar uses a fixed output block size, so if you make the tar block size and the dact block sizes align then you can trivially recover from many kinds of corruption by ignoring broken dact blocks.
              2. It has a couple of low-grade checksums (on the compressed and uncompressed data) – this could be improved with cryptographic hashes
              3. Though many of the best compression algorithms are using external libraries like zlib and libbz2, there are some okay-ish ones I wrote that are simple enough to do by hand if needed

              Good luck !

              1. 1

                Thanks for sharing.

                How easy do you think it would be to do recovery on a damaged archive?

                1. 1

                  Over all it shouldn’t be too difficult, the DACT format is described here: http://dact.rkeene.org/fossil/artifact/e942be8628bac375

                  So, it’s a stream of blocks, each of which describes how large it is, which is error-checkable (since if it doesn’t decompress to the right length, you know something is wrong with the block, and can start seeking forward in the archive for the next block, which is also error-checkable so there’s no harm in getting it wrong. In the end, you will know how many blocks you missed in total.

              2. 1

                Good write-up. Somewhat corroborating bityard, that storage is cheaper and designs building on that should be considered. For instance, one could have a few HD’s from different manufacturers that store the same data. The software periodically loads, hashes, and compares them. Auto-corrects based on 2 out of 3. Alternatively, doesn’t auto-correct with the users manually checking the others in rare event a file on main disk is corrupt. I’d say diversify their interfaces, too, based on past experience with USB drivers corrupting my stuff.

                Also, optical media will likely fail differently from the disks. DVD’s are cheap. So, I backup most critical stuff to HD’s and DVD’s. The DVD’s are also good for off-loading non-critical, less-used stuff that takes up lots of your HD space.

                1. 2

                  When I have a chance to build a new backup server I’m morbidly considering an LTO drive, as bulk older LTO cartridges show up regularly in recycle-PC stores around here.

                  1. 2

                    Are we to the point of writable DVDs being suitable for archival use? I have a handful of old burnt DVDs that have been kept in jewel cases in a dark, low humidity area and some of them have observable pitting.

                    Maybe it was a bad batch, but it made me nervous.

                    1. 2

                      Good point. Mine tend to last a few years. If regular backups, you’ll make new ones long before old ones go out. If long-term archiving, you’ll need to periodically move them to new media. That’s also a good time to compare hashes of HD’s against DVD’s, too.