1. 42
  1.  

  2. 11

    I wonder how this would compare with zfs send and encrypted snapshots.

    1. 6

      I’d also like to see zstd swapped in for gzip.

      1. 2

        I’m not convinced the performance tradeoffs make ZFS’ zstd implementation a worthy replacement for the good speed/performance balance of lz4, FWIW. Definitely a replacement for gzip though.

      2. 2

        My guess is using raw sends would be better in almost all categories except maybe compression.

        1. 6

          This reminds me of an interview questions I heard about a while back.

          Is it better to encrypt then compress or compress then encrypt?

          The incorrect “correct” answer is that is to compress then encrypt, as a encrypted blob should be relatively indistinguishable from random noise so compressing an encrypted blob will do very little. So you should always compress and encrypt.

          The correct response is that compressing and encrypt can leak information about the data being encrypted. There is the BEAST & CRIME as to obvious examples of this, but in the example of an compress and encrypted audio stream you can actually use variation in packet size to approximate what is being said.

          So the answer is, as always, it depends, but generally for security just encrypt it and send it. Compressing it is largely useless, or potentially harmful.

          1. 3

            I think that’s more for client-server protocols, though? In practice, the threat model for ZFS encryption is someone with physical access to your disk or someone who can intercept the send stream. The manpage warns about using encryption and compression (as well as encryption and deduplication, but that one is obvious to me, dedup weakens the encryption in a really obvious way) - but I’m struggling to see how someone could mount a CRIME attack against the same disk. Maybe multitenant environments?

            Since compression is applied before encryption datasets may be vulnerable to a CRIME-like attack if applications accessing the data allow for it. Deduplication with encryption will leak information about which blocks are equivalent in a dataset and will incur an extra CPU cost per block written.

            1. 4

              My understanding is a lot of these attacks depend on the attacker being able to induce the client to encrypt attacker controlled data and also be able to see the encrypted result. This seems extremely difficult (though not totally impossible) with tools such as bupstash, this is because backups are largely one way transmissions and offline interactions often at some point far in the future.

        2. 2

          I also experimented with btrfs send into a bupstash repository, which seems to work quite well, though is not totally incremental like what you suggest. The bupstash send log at least cuts down on network traffic, though in that case it still requires we read the whole snapshot from disk.

          https://bupstash.io/doc/guides/Filesystem%20Backups.html#Btrfs-send-snapshots.

        3. 11

          One thing I’d like to see evaluated is the performance of restoring a single file from a backup – when your backups are large enough that restoring a full one is a multi-hour (or day) process, doing that just to extract one small file is frustratingly inefficient. Some backup systems offer a fast way of doing that, others (I believe it was duplicity that I’m having painful memories of here) do not.

          1. 10

            I have this feature in bupstash with ‘list-contents’ and ‘get –pick’, I added it since I remember how annoying it was to restore a single ssh key from a 200GB snapshot. I will try to add that as a benchmark.

          2. 3

            I always come back to an encrypted volume and rsync.

            /usr/local/bin/rsync -axHAX --partial --info=progress2 \
                --link-dest=../$OLD/ . /Volumes/backupvolA/$NEW/
            

            The nice thing about this setup is that all the files are sitting bare on a volume with each snapshot standing on its own as a collection of hardlinks and it works locally or over ssh.

            Because it relies on rsync to do the delta, rsync has to enumerate both the source and the destination file systems and generate a high number of hardlinks for each new destination.

            Would it be possible to add this to the comparison?

            1. 1

              I don’t really count this as the same thing because the server side needs access to the decryption key for this to work. I should perhaps have clarified that further.

              1. 1

                Not necessarily – something like encfs/ecryptfs or a loop-mounted LUKS volume with its backend storage in a remote filesystem could keep the encryption local (this is actually what I do for my own backups using rsnapshot).

                1. 1

                  That’s a pretty good idea, i think a remote loop file should perform well.

            2. 2

              Could it be used on Windows?

              1. 4

                I haven’t tested bupstash on windows yet, but it’s something I plan to make work, I suspect it might need some fixes first,

                1. 1

                  It also means you have to implement VSS support into bupstash, because backups on windows without supporting the VSS features wont make any sense ..

              2. 2

                I’m currently using restic to backup to Backblaze B2. I’m actually pretty keen to switch to bupstash given it seems quite a bit faster. Are there any plans to support object stores like B2 or S3 or is the requirement for a server process likely to remain an obstacle there?

                1. 3

                  Bupstash supports external storage via a plugin interface, this is evolving but definitely something that is in the works.

                  1. 1

                    Excellent. I’ll continue to keep an eye on the progress.

                2. 2

                  Why didn‘t you benchmark duplicity? It is probably a lot slower, but it would be interesting how much slower.

                  Also you said, you want to push the performance further. I would recommend you to improve the software where it is most needed. Your software is already best in class performance-wise and users were probably fine with the performance of the other solutions beforehand. It feels great to improve performance as it is measured so easily and clearly.

                  1. 2

                    Why didn‘t you benchmark duplicity?

                    Only because I have never used that before, It might be worth trying for a part 2 post.

                    I would recommend you to improve the software where it is most needed.

                    Bupstash makes other improvements to areas I found lacking, mainly around access controls and offline decryption keys. I do have plans for other improvements too, its just this post was focused on performance. Part of the motivation of this post was to attract new users for feedback, a chicken and egg problem.

                    1. 1

                      Thanks for the fast response. The improvements bupstash brings to the table are great and because of that I’m just about make the first backup with it along duplicity. I was just saying don’t focus on performance too much :D

                      Thank you for you effort!

                  2. 2

                    Have you considered comparing this against other compression means that can use processors in parallel?

                    For example, tar can use pigz or other compression tools that can saturate all of te CPUs on your system.

                    1. 1

                      I had not, I will definitely consider adding it if there is a part 2 post.

                    2. 1

                      Is there an option to verify the files integrity at the source and backup during the process? The troubling thing about backup it is automated. But what can guarantee the video I last accessed 8 years ago is still a valid file?

                      1. 1

                        The single biggest issue I have with borg is the lack of support for using s3 buckets as a backup target. If bupstash supported this use case, I would happily switch to it regardless of how it performed performance-wise against borg.

                        1. 2

                          This is something I want to support, though it still requires a server side ‘gateway’ process and repository, but the bulk of the data is in external storage.