1. 17
    1. 9

      Balancing simplicity vs features is quite a challenge - in many ways I like spartan simplicity, but I think pruning is a useful feature. if you want something that supports pruning, access controls and asymmetric keys see my similar tool bupstash:

      https://github.com/andrewchambers/bupstash

      It is very close to 1.0 and has been quite stable for the past year or so.

      1. 2

        this is very cool! i agree that balance is hard. bugs in a backup system are typically discovered when data is lost.

    2. 2

      Similar in concept, but with zero dependencies and designed for confidentiality: https://github.com/richfelker/bakelite

      1. 1

        this is very cool! rss subscribed.

        what url source did inlined crypto dependencies come from?

    3. 1

      Cool! I want to know more about the tarball chunks; as it is, I can’t gauge how hard it is to prune old backups.

      1. 1

        thanks! after a data loss event, i swore: never again.

        backups are meant to be immutable. the hash of the tarball is commited to the index.

        not sure what your pruning use case is, but i’m sure the design could be changed to accommodate it.

        you could edit existing tarballs, deleting content you don’t want to keep, then mutate it in object storage and update the index.

        however, it is not a goal of this at present.

        1. 2

          Think about the space consumed; will it grow without bound? Will restoring a system get slower, the more numerous backups will be referenced?

          You could follow something like Time Machine defaults, or keep n hourly backups, X daily backups, Y weekly backups, Z monthly backups, and m annual backups. A common name for this is Grandfather-father-son.

          Do you want to hold hourly or daily backups for, say, five years? (I don’t.) Tarsnap is notably faster when working with less data, so I don’t keep daily backups forever.

          1. 2

            Think about the space consumed; will it grow without bound? Will restoring a system get slower, the more numerous backups will be referenced?

            yes. storage will grow without bound. one easy strategy would be to periodically start a new backup, and destroy all storage associated with an old backup. this would destroy all history and reset state to the current local state.

            You could follow something like Time Machine defaults, or keep n hourly backups, X daily backups, Y weekly backups, Z monthly backups, and m annual backups. A common name for this is Grandfather-father-son.

            true, and that’s a good model. what wanted was something more like git. i don’t expect git to randomly prune old commits or destroy random blobs from history. i expect git to preserve exactly what i commited.

            Do you want to hold hourly or daily backups for, say, five years? (I don’t.) Tarsnap is notably faster when working with less data, so I don’t keep daily backups forever.

            i am not familiar with the internals of tarsnap. here i am creating a single text file containing the metadata of the filesystem. this file is versioned in git. a new backup will do a linear scan over the complete modification history of this file via git log -p index. like storage used, this file grows without bound. this linear scan will eventually become annoyingly slow. i assume tarsnap slowdowns are from a similar reason.

            when that happens, i will create a new backup, but not destroy the old one. this new backup will not be able to deduplicate against the old backup, and so it will copy to storage any files in current local state that already exist in the previous backup. some storage is wasted, but the index history is truncated. nothing is lost, since the old backup is still accessible.

            my backup index is currently:

            >> wc -l index
            
            101965 index
            

            my backup index revision history is currently:

            >> time git log -p index | wc -l
            
            518572
            
            real    0m12.136s
            user    0m11.745s
            sys     0m0.465s
            

            backup-add currently looks like this:

            >> time backup-add
            
            scanned: 101971 files
            new data size: 0.89 mb
            
            real    1m9.056s
            user    0m57.498s
            sys     0m11.347s
            
            >> df -h
            Filesystem           Size  Used Avail Use% Mounted on
            zroot3/data/home     194G   74G  121G  38% /home
            

            once scanning index history is slower than blake2b scanning the filesystem, i will probably start a new backup.

            remote storage for the 96 backups i’ve made in the current backup look like this:

            >> aws s3 ls $bucket/$backup/git/ | py 'sum([int(x.split()[2]) for x in i.splitlines()])' | humanize
            68 MB
            
            >> aws s3 ls $bucket/$backup/tar/ | py 'sum([int(x.split()[2]) for x in i.splitlines()])' | humanize
            4.3 GB
            
            >> aws s3 ls $bucket/$backup/tar/ | wc -l
            96
            
            
    4. 1

      This smells similar to bup, but a bit more custom?

      1. 1

        there are a lot of good backup solutions. while doing prior art research, i disliked to broad scope and complex implementation of existing solutions.

        i wanted total confidence in my ability to reason about the data structures and machinery of backup.

        as long as one has confidence that their backups exist and are recoverable, any solution is likely fine.