1. 25

Hi guys, I’m writing a backup tool with deduplication and public key encryption/decryption. This blog has some progress reports that I’d like to share as things come out.

(I wanted announce tag, but apparently it applies release, which is not true).

  1.  

  2. 4

    I don’t think anything you want implies public key cryptography. And that actually opens up a new vulnerability to quantum computers when they arrive. You can do it all with symmetric encryption and hashing, neither of which is vulnerable to quantum attacks.

    This is the technique we use in Peergos.

    Initial key derivation: https://peergos.github.io/book/security/login.html

    and subsequent access control with cryptree: https://peergos.github.io/book/security/cryptree.html

    cryptree paper: https://github.com/Peergos/Peergos/raw/master/papers/wuala-cryptree.pdf

    1. 1

      I had thought previously about the idea of a master symmetric key and subordinate device keys. It is worth thinking about, thanks for the links.

    2. 3

      You might want to take a look to https://burp.grke.org/ too which worked really great for me. The only issue as always are:

      • Synchronizing large batch of backups (so i don’t backups 100 machines at 1am and 0 at 5am, so at 1am every backup fight for bandwidth with the backup server…)
      • Having metrics about my backups (fetching the size, time spent, number of files, … so I can monitor it and find if there is huge differences that I should be aware of)
      • cloud native (upload/stream to S3 compatible storage)
      1. 1

        Burp has transport encryption and client side encryption, but iirc not asymmetric encryption

      2. 3

        This sounds quite cool. About a year ago I also investigated existing backup solutions. Are you aware of Ugarit? It’s based on the concept of a content-addressable store, like Git, which means you get deduplication of data blocks for free. It uses optional pluggable encryption and has a pluggable backend (I contributed an S3 backend for it which I’m currently using on three machines as my only backup solution).

        The encryption used by Ugarit is symmetric, so I think it doesn’t qualify as having “secure keys”. Since its encryption is pluggable, there might be a way to implement what you’re looking for, but I haven’t studied the implementation closely enough to tell.

        It implements reference counting, so if you drop a snapshot it will decrease all the block counters by one, and if they hit zero the block will be deleted. Unfortunately there’s currently no easy command to automatically prune backups older than a certain age. I have been meaning to look into that, see if I can contribute to it, but haven’t got around to that yet.

        I don’t quite understand what you mean by the “write only” requirement. Could you elaborate on that, with examples perhaps?

        So for Ugarit: dedup: yes, write-only: ??, encryption: yes, secure keys: no (for now?), pruning: yes, open source: yes.

        1. 2

          I will look into it, thank you. It seems so many developers just stop at symmetric keys and I have no idea why. All my prototypes so far are garbage collected and not reference counted :) Another point is that I don’t want to use the same content addressing that is typically used, because that leaks a hash of the data, which lets the server see if you might have a specific file. I will need to see how uragit chooses a hash function to be sure.

          I mean two things by write only, though maybe the terminology should be changed. One is the same way sending an email is ‘write only’, the server accepts what you send, but doesn’t have a deletion method until you login as a recipient and specifically delete it, the other is that the key used to encrypt the data is forgotten, so even if you could read the data again, you would not be able to decrypt it.

          1. 1

            Another point is that I don’t want to use the same content addressing that is typically used, because that leaks a hash of the data. I will need to see how uragit chooses a hash function to be sure.

            This is a good feature to add to your requirements list. Ugarit allows a user-configurable “salt” which is used to perturb the hash, which means you choose what you want to leak. The thing that annoys me a little bit (but not enough to fix it) is that Ugarit stores the names of the snapshot tags in the clear on the backend. This means you need to be a bit careful in how you identify your backups.

            One is the same way sending an email is ‘write only’, the server accepts what you send, but doesn’t have a deletion method

            Isn’t that in conflict with the “prune” requirement? It seems to me that means you must be able to delete data blocks, by definition. Also, as far as I can tell, every “dumb” backend would allow one to simply delete the underlying blocks using the S3/SFTP/POSIX API directly.

            the other is that the key used to encrypt the data is forgotten, so even if you could read the data again, you would not be able to decrypt it.

            That would be the case for Ugarit too; only the client has the key. I suppose that should be the case for every backup solution which supports encryption, no?

            1. 4

              Isn’t that in conflict with the “prune” requirement? It seems to me that means you must be able to delete data blocks, by definition. Also, as far as I can tell, every “dumb” backend would allow one to simply delete the underlying blocks using the S3/SFTP/POSIX API directly.

              There are three parties involved - Backup writer, Storage Server and Backup Admin. The idea is that the backup writer can only add backups. The storage server would have a very simple api for the backup writer, the backup admin can do pruning and deletion. The main idea being that in an attack scenario, the backup writer itself can no longer be trusted.

              That would be the case for Ugarit too; only the client has the key. I suppose that should be the case for every backup solution which supports encryption, no?

              No, many use symmetric keys which means the key used to encrypt the data can also decrypt the data. My prototype tool generates an ephemeral key per stream then forgets it.

              1. 1

                The storage server would have a very simple api for the backup writer, the backup admin can do pruning and deletion. The main idea being that in an attack scenario, the backup writer itself can no longer be trusted.

                I see. This is highly dependent on the backend, though, and AFAICT it requires some form of a “smart” backup host. With this requirement you can’t use a thumb drive or SFTP for your backups, for example. With Ugarit, this might be possible with the S3 backend, if you use Amazon’s bucket versioning mechanism (which is append-only). I haven’t looked into that in detail yet.

                Given that this requires some sort of integration with the backend, I’d say for now that Ugarit does not support this (or at least, I’m not aware of it).

                No, many use symmetric keys which means the key used to encrypt the data can also decrypt the data. My prototype tool generates an ephemeral key per stream then forgets it.

                That sounds pretty cool. How does one recover the necessary key for a given block (which might’ve been backed up in an earlier session) in order to restore a backup?

                1. 2

                  it requires some form of a “smart” backup host.

                  My main intention is for it to go via ssh - backup-client | ssh backup-server backup-recv. Where backup-recv sends data into s3 or whatever you want. If you just run the backup-recv on the same machine, then it will be effectively the same as sending directly to s3.

                  How does one recover the necessary key for a given block (which might’ve been backed up in an earlier session) in order to restore a backup?

                  Only the private portion of the key is discarded. The recipient (backup admin) private key can be combined with the discarded keys public portion to regenerate the decryption key.

                  edit: ssh allows forced commands with ssh keys, which is important for access control.

            2. 1

              I will look into it, thank you. It seems so many developers just stop at symmetric keys and I have no idea why.

              I’ll speculate on symmetric algorithms. You can use them for authenticated encryption and integrity checks. They’re many times faster than asymmetric algorithms. They have hardware support more often. Quantum computers can’t crack them with existing algorithms. The symmetric ciphers use operations more familiar to programmers than things like elliptic curves. They favor stuff they’re less likely to screw up. They can even do key exchange if they can share a secret ahead of time or just on a channel unlikely to be intercepted by those they’re really worried about.

              All that reduces down to asymmetric ciphers are mainly needed for key exchange when there’s nothing pre-shared or trusted for initial exchange. There’s libraries for that last part where hard work is already done. A few probably got updated for latest, esoteric attacks. They go with them. Sometimes, it’s an external tool like GPG moving an initial secret that’s manually fed into the symmetric system. I used to do that.

              Edit: Added quantum immunity. Thanks for reminder, ianopolous.

          2. 3

            No one has mentioned zfs send --raw with encrypted volumes yet? You get snapshots, zero knowledge storage, compression, dedup (with some cryptographic security tradeoffs), data integrity guarantees, …

            1. 1

              Maybe a good option on systems with good zfs support, does zfs send have append only or write only access controls?

              1. 4

                In zfsonlinux 0.8, you can send a encrypted dataset without unlocking it, which means that the person on the other end doesn’t know what they’re receiving and can’t read it. That covers the “write only” bit, “append only” should be covered by snapshots, which are immutable.

            2. 5

              I’ve recently been investigating deduplicating backup software for my personal use, and ended up deciding on borg, but one major annoyance of borg is that it cannot easily use a cloud storage API (like aws s3, wasabi, backblaze, google cloud, etc.) as a backing store. Some other backup tools do allow this (restic for instance), but lack other features I care about.

              I also note that you didn’t mention “compression” as a desideradum. As long as you care about deduplicating data to save backup storage case, however, you probably also care about compressing that same stored data as much as possible.

              Also, looking at you’re asmcrypt repo, I have to ask, why are you writing new software in C in 2018? Especially cryptography software?

              1. 4

                As another note, I’m currently working on a product that may be close to what you would like: https://backupbox.io/ (And there is no C in that code base, all Go currently ) Supporting borg is a specific goal, unfortunately it isn’t quite ready for public use.

                1. 3

                  Just FYI, there’s already a project called Box Backup.

                  1. 1

                    thanks for letting me know.

                2. 4

                  If you haven’t looked at https://www.tarsnap.com/ you may find it interesting. The backend is closed source, which I realize may be a problem, but the frontend if open source and free for perusal.

                  1. 4

                    “The Tarsnap client source code is available so that you can see exactly how your data is protected; however, it is not distributed under an Open Source license.” - https://www.tarsnap.com/open-source.html

                  2. 2

                    https://www.google.com/search?q=desideradum

                    Please note your typo, otherwise thanks for the new word.

                    1. 1

                      Bah I thought the word didn’t feel right when I typed it!

                    2. 2

                      edit: Added note to post. Thank you.

                      With regard to compression, this will come in a post about deduplication.

                    3. 2

                      I’ll be closely watching this, as a happy Borg user, the passphrase-in-the-clear thing has always niggled at me.

                      1. 1

                        Passphrase-in-the-clear?

                        1. 1

                          Yes, the archive passphrase has to be stored somewhere (inside your backup script etc).

                          https://borgbackup.readthedocs.io/en/stable/quickstart.html#automating-backups

                          I don’t use pass.

                        2. 1

                          Thank you, progress will be slow as It’s difficult to prioritize OSS, but hopefully it will be steady. I have lots of existing code and prototypes already to pull ideas and code from.

                        3. 1

                          I read the “Work begins” post but couldn’t find much info on how it’s going to be implemented. Data deduplication along with client-side encryption is hard and I’m curious how this will be solved.

                          1. 3

                            Sure, I will write up and post on lobste.rs, it might be generally interesting for people even if the project doesn’t complete soon.

                              1. 1

                                Wow none of those are what I had in mind (I think). Interesting…

                            1. 1

                              Hopefully this explains some of it… https://packnback.github.io/blog/dedup_and_encryption/