1. 27

Hi guys, I’m writing a backup tool with deduplication and public key encryption/decryption. This blog has some progress reports that I’d like to share as things come out.

(I wanted announce tag, but apparently it applies release, which is not true).

  1.  

  2. 4

    I don’t think anything you want implies public key cryptography. And that actually opens up a new vulnerability to quantum computers when they arrive. You can do it all with symmetric encryption and hashing, neither of which is vulnerable to quantum attacks.

    This is the technique we use in Peergos.

    Initial key derivation: https://peergos.github.io/book/security/login.html

    and subsequent access control with cryptree: https://peergos.github.io/book/security/cryptree.html

    cryptree paper: https://github.com/Peergos/Peergos/raw/master/papers/wuala-cryptree.pdf

    1. 1

      I had thought previously about the idea of a master symmetric key and subordinate device keys. It is worth thinking about, thanks for the links.

    2. 3

      You might want to take a look to https://burp.grke.org/ too which worked really great for me. The only issue as always are:

      • Synchronizing large batch of backups (so i don’t backups 100 machines at 1am and 0 at 5am, so at 1am every backup fight for bandwidth with the backup server…)
      • Having metrics about my backups (fetching the size, time spent, number of files, … so I can monitor it and find if there is huge differences that I should be aware of)
      • cloud native (upload/stream to S3 compatible storage)
      1. 1

        Burp has transport encryption and client side encryption, but iirc not asymmetric encryption

      2. 3

        This sounds quite cool. About a year ago I also investigated existing backup solutions. Are you aware of Ugarit? It’s based on the concept of a content-addressable store, like Git, which means you get deduplication of data blocks for free. It uses optional pluggable encryption and has a pluggable backend (I contributed an S3 backend for it which I’m currently using on three machines as my only backup solution).

        The encryption used by Ugarit is symmetric, so I think it doesn’t qualify as having “secure keys”. Since its encryption is pluggable, there might be a way to implement what you’re looking for, but I haven’t studied the implementation closely enough to tell.

        It implements reference counting, so if you drop a snapshot it will decrease all the block counters by one, and if they hit zero the block will be deleted. Unfortunately there’s currently no easy command to automatically prune backups older than a certain age. I have been meaning to look into that, see if I can contribute to it, but haven’t got around to that yet.

        I don’t quite understand what you mean by the “write only” requirement. Could you elaborate on that, with examples perhaps?

        So for Ugarit: dedup: yes, write-only: ??, encryption: yes, secure keys: no (for now?), pruning: yes, open source: yes.

        1. 2

          I will look into it, thank you. It seems so many developers just stop at symmetric keys and I have no idea why. All my prototypes so far are garbage collected and not reference counted :) Another point is that I don’t want to use the same content addressing that is typically used, because that leaks a hash of the data, which lets the server see if you might have a specific file. I will need to see how uragit chooses a hash function to be sure.

          I mean two things by write only, though maybe the terminology should be changed. One is the same way sending an email is ‘write only’, the server accepts what you send, but doesn’t have a deletion method until you login as a recipient and specifically delete it, the other is that the key used to encrypt the data is forgotten, so even if you could read the data again, you would not be able to decrypt it.

          1. 1

            Another point is that I don’t want to use the same content addressing that is typically used, because that leaks a hash of the data. I will need to see how uragit chooses a hash function to be sure.

            This is a good feature to add to your requirements list. Ugarit allows a user-configurable “salt” which is used to perturb the hash, which means you choose what you want to leak. The thing that annoys me a little bit (but not enough to fix it) is that Ugarit stores the names of the snapshot tags in the clear on the backend. This means you need to be a bit careful in how you identify your backups.

            One is the same way sending an email is ‘write only’, the server accepts what you send, but doesn’t have a deletion method

            Isn’t that in conflict with the “prune” requirement? It seems to me that means you must be able to delete data blocks, by definition. Also, as far as I can tell, every “dumb” backend would allow one to simply delete the underlying blocks using the S3/SFTP/POSIX API directly.

            the other is that the key used to encrypt the data is forgotten, so even if you could read the data again, you would not be able to decrypt it.

            That would be the case for Ugarit too; only the client has the key. I suppose that should be the case for every backup solution which supports encryption, no?

            1. 4

              Isn’t that in conflict with the “prune” requirement? It seems to me that means you must be able to delete data blocks, by definition. Also, as far as I can tell, every “dumb” backend would allow one to simply delete the underlying blocks using the S3/SFTP/POSIX API directly.

              There are three parties involved - Backup writer, Storage Server and Backup Admin. The idea is that the backup writer can only add backups. The storage server would have a very simple api for the backup writer, the backup admin can do pruning and deletion. The main idea being that in an attack scenario, the backup writer itself can no longer be trusted.

              That would be the case for Ugarit too; only the client has the key. I suppose that should be the case for every backup solution which supports encryption, no?

              No, many use symmetric keys which means the key used to encrypt the data can also decrypt the data. My prototype tool generates an ephemeral key per stream then forgets it.

              1. 1

                The storage server would have a very simple api for the backup writer, the backup admin can do pruning and deletion. The main idea being that in an attack scenario, the backup writer itself can no longer be trusted.

                I see. This is highly dependent on the backend, though, and AFAICT it requires some form of a “smart” backup host. With this requirement you can’t use a thumb drive or SFTP for your backups, for example. With Ugarit, this might be possible with the S3 backend, if you use Amazon’s bucket versioning mechanism (which is append-only). I haven’t looked into that in detail yet.

                Given that this requires some sort of integration with the backend, I’d say for now that Ugarit does not support this (or at least, I’m not aware of it).

                No, many use symmetric keys which means the key used to encrypt the data can also decrypt the data. My prototype tool generates an ephemeral key per stream then forgets it.

                That sounds pretty cool. How does one recover the necessary key for a given block (which might’ve been backed up in an earlier session) in order to restore a backup?

                1. 2

                  it requires some form of a “smart” backup host.

                  My main intention is for it to go via ssh - backup-client | ssh backup-server backup-recv. Where backup-recv sends data into s3 or whatever you want. If you just run the backup-recv on the same machine, then it will be effectively the same as sending directly to s3.

                  How does one recover the necessary key for a given block (which might’ve been backed up in an earlier session) in order to restore a backup?

                  Only the private portion of the key is discarded. The recipient (backup admin) private key can be combined with the discarded keys public portion to regenerate the decryption key.

                  edit: ssh allows forced commands with ssh keys, which is important for access control.

            2. 1

              I will look into it, thank you. It seems so many developers just stop at symmetric keys and I have no idea why.

              I’ll speculate on symmetric algorithms. You can use them for authenticated encryption and integrity checks. They’re many times faster than asymmetric algorithms. They have hardware support more often. Quantum computers can’t crack them with existing algorithms. The symmetric ciphers use operations more familiar to programmers than things like elliptic curves. They favor stuff they’re less likely to screw up. They can even do key exchange if they can share a secret ahead of time or just on a channel unlikely to be intercepted by those they’re really worried about.

              All that reduces down to asymmetric ciphers are mainly needed for key exchange when there’s nothing pre-shared or trusted for initial exchange. There’s libraries for that last part where hard work is already done. A few probably got updated for latest, esoteric attacks. They go with them. Sometimes, it’s an external tool like GPG moving an initial secret that’s manually fed into the symmetric system. I used to do that.

              Edit: Added quantum immunity. Thanks for reminder, ianopolous.

          2. 3

            No one has mentioned zfs send --raw with encrypted volumes yet? You get snapshots, zero knowledge storage, compression, dedup (with some cryptographic security tradeoffs), data integrity guarantees, …

            1. 1

              Maybe a good option on systems with good zfs support, does zfs send have append only or write only access controls?

              1. 4

                In zfsonlinux 0.8, you can send a encrypted dataset without unlocking it, which means that the person on the other end doesn’t know what they’re receiving and can’t read it. That covers the “write only” bit, “append only” should be covered by snapshots, which are immutable.

            2. 5

              I’ve recently been investigating deduplicating backup software for my personal use, and ended up deciding on borg, but one major annoyance of borg is that it cannot easily use a cloud storage API (like aws s3, wasabi, backblaze, google cloud, etc.) as a backing store. Some other backup tools do allow this (restic for instance), but lack other features I care about.

              I also note that you didn’t mention “compression” as a desideradum. As long as you care about deduplicating data to save backup storage case, however, you probably also care about compressing that same stored data as much as possible.

              Also, looking at you’re asmcrypt repo, I have to ask, why are you writing new software in C in 2018? Especially cryptography software?

              1. 4

                As another note, I’m currently working on a product that may be close to what you would like: https://backupbox.io/ (And there is no C in that code base, all Go currently ) Supporting borg is a specific goal, unfortunately it isn’t quite ready for public use.

                1. 3

                  Just FYI, there’s already a project called Box Backup.

                  1. 1

                    thanks for letting me know.

                2. 4

                  If you haven’t looked at https://www.tarsnap.com/ you may find it interesting. The backend is closed source, which I realize may be a problem, but the frontend if open source and free for perusal.

                  1. 5

                    “The Tarsnap client source code is available so that you can see exactly how your data is protected; however, it is not distributed under an Open Source license.” - https://www.tarsnap.com/open-source.html

                  2. 2

                    https://www.google.com/search?q=desideradum

                    Please note your typo, otherwise thanks for the new word.

                    1. 1

                      Bah I thought the word didn’t feel right when I typed it!

                    2. 2

                      edit: Added note to post. Thank you.

                      With regard to compression, this will come in a post about deduplication.

                    3. 2

                      Various things throughout the blog posts linked make me nervous. Using HMAC as a random oracle isn’t ideal, but probably not immediately problematic. HMAC then Encrypt makes me very nervous.

                      Taking a step back: you have to assume the encrypting party is not compromised at time of backup, or else confidentiality is trivially lost. Why not then symmetrically encrypt with authenticated encryption, public-key encrypt the data encryption key and send it along, then locally delete the symmetric key?

                      Deduplication leads to potential side-channel attacks on backup confidentiality in a model which assumes compromised backup hosts.

                      Randomized encryption is critical for confidentiality: don’t use textbook rsa.

                      Obviously don’t put anything important into a homebrewed cryptosystem.

                      1. 1

                        HMAC then Encrypt makes me very nervous.

                        If there is evidence for that, explain or cite. HMAC is designed specifically for cases like this.

                        Doing no encryption, or having no access controls, or plain SHA deduplication all make me far more nervous than HMAC then encrypt. Let me know what you use so that you are not nervous? I don’t claim this is perfect, just that I think it is an improvement over what I have seen.

                        Taking a step back: you have to assume the encrypting party is not compromised at time of backup, or else confidentiality is trivially lost. Why not then symmetrically encrypt with authenticated encryption, public-key encrypt the data encryption key and send it along, then locally delete the symmetric key?

                        This is essentially what nacl secret box does, which is what packnback uses. Not sure where the criticism is here.

                        Obviously don’t put anything important into a homebrewed cryptosystem.

                        nacl crypto box is not home brewed. It was designed by a respected cryptographer. I also didn’t design HMAC or sha256. Explain to me which part you consider home brewed?

                        Deduplication leads to potential side-channel attacks on backup confidentiality in a model which assumes compromised backup hosts.

                        By this logic we should just do nothing. This design follows a tiers of security, and potential side-channels AFTER the storage host is compromised is better than most systems we have today.

                        1. 1

                          MAC then encrypt is one method of combining authenticity and confidentiality, but while it isn’t as broken as MAC-and-encrypt (which doesn’t necessarily satisfy IND-CPA) it is potentially problematic for attacks which are mitigated by IND-CCA2 security such as padding oracles.

                          Basically, nervous because they’re using an HMAC sort of as a random oracle, and not for message authenticity, but might be using it for authenticity, in which case encrypt-then-mac would be preferable.

                          NaCl is very likely good software. I’m glad they’re using it- all I meant was that even good parts can be put together in potentially vulnerable ways, and that’s particularly easy to do with crypto.

                          To your last point, I strongly disagree that my statement implies inaction. I simply meant to point out a class of attacks that need to be carefully avoided when implementing a deduplicating backup system.

                          Reasonable explanations for any interested https://en.wikipedia.org/wiki/Authenticated_encryption https://en.wikipedia.org/wiki/Chosen-plaintext_attack https://en.wikipedia.org/wiki/Adaptive_chosen-ciphertext_attack

                          1. 1

                            Reasonable explanations for any interested https://en.wikipedia.org/wiki/Authenticated_encryption https://en.wikipedia.org/wiki/Chosen-plaintext_attack https://en.wikipedia.org/wiki/Adaptive_chosen-ciphertext_attack

                            Can you explain to me how a chosen plaintext attack would work in this scenario? As was mentioned, it is essentially the same idea as gpg encrypted email. Are you suggesting the ability to send encrypted emails to someone is the same as a chosen plain text attack? Because that seems very wrong to me.

                            Sending arbitrary encrypted backups does not give you access to the time and place the administrator accesses your poisoned backups, so I’m not sure what you are suggesting. Obviously it is up to the administrator to detect the time of compromise and discard backups after that point in time, but no system can avoid that issue.

                            Basically, nervous because they’re using an HMAC sort of as a random oracle, and not for message authenticity, but might be using it for authenticity, in which case encrypt-then-mac would be preferable.

                            So in encrypt-then-mac how do you ensure a stable nonce so that deduplication can function? The only approach I have seen is people derive the nonce from the plain text in which case the nonce is now what leaks bits. HMAC seems far preferable to that.

                            I strongly disagree that my statement implies inaction. I simply meant to point out a class of attacks that need to be carefully avoided when implementing a deduplicating backup system.

                            You implied it was a ‘home brewed’ and untrustworthy set of ideas… then you offered no alternative.

                            1. 1

                              Deterministic encryption (such as textbook RSA) is vulnerable to chosen-plaintext attacks which compromise indistinguishability from randomness, and with potentially known plaintext file contents might compromise message confidentiality. A backup server might be able to run this kind of attack - they can be thought of as an active MitM adversary against encrypting machine and the decrypting machine.

                              Now, using NaCl or similar means that it’s unlikely that the encryption is deterministic, which is great. The next class of attacks I’d be worried about are of the Vaudenay/Bleichenbacher style, which are mitigated through authenticated encryption, or something like OAEP- hopefully this is used in the implementation. The argument “we don’t need CCA2 security because there isn’t a decryption oracle”- not that you’re making that argument -falls through in too many historical cases where a decryption oracle is later found, and thus I’d strongly lean toward AE just as a best practice.

                              Regarding encrypt-then-mac vs mac-then-encrypt, and HMAC: I’m not sure what the best option for a stable nonce is. My comment was simply to point out that mac-then-encrypt doesn’t have the property of ciphertext integrity which the implementation may want/rely on, and to be careful to avoid relying on it for anything the construction does not necessarily provide. Perhaps something involving the hashing of authenticated-encryption ciphertexts would solve this problem, or something similar- intuitively I think there’s probably something simpler than HMACing the plaintext -but I don’t want to claim anything without in-depth analysis.

                              Regarding offering an alternative: raising a potential concern comes with no obligation of offering a solution. Again: in-depth analysis that I don’t necessarily consider myself qualified to perform would be required to ensure that this system is secure, and I wouldn’t want to suggest an alternative or definitively claim security or insecurity without such analysis.

                              What concerns me is that implementation has started seemingly without thorough exploration of these threat models …unless this project is simply for educational purposes, in which case, hooray, go for it, but make that clear in your documentation. But if this will ever be recommended to anyone as an actual tool for securing important backups, it would need thorough consideration of the above, and likely many more, subtler potential issues.

                              1. 2

                                Vaudenay/Bleichenbacher style … which are mitigated through authenticated encryption.

                                Yes, there is another layer of authenticated encryption used outside of the dedup key. This is provided by NaCL. The encryption is a combination of Salsa20 and Poly1305 with random nonces. My current implementation follows all guidelines from NaCl.

                                As far as I can see, is the main complaint you have is that there is an HMAC of the plaintext along side the cipher text. This is an open question that warrants more investigation, but as things currently stand, the state of the art is that HMAC is not reversible, and and NACL with a random nonce is secure even if the plain text is known. In packnback case the plain text may theoretically be hinted at with sophisticated attacks that don’t exist yet, that may require breaking HMAC-SHA256 or Salsa20 or the interactions between the two. The day this happens is the day the project would become packnback2

                                There are no chosen cipher text or adaptive cipher text attacks that I can see either, because encrypted data is written one way by design, there is no feed back to an attacker.

                                The storage provider itself is in a semi privileged position, if there is a fundamental weakness found, the system would revert to being signed plain text backups. But again, the design is layered such that this is still better than many providers today while being in the worst case failure mode. You still get strict append only semantics protecting historic data from a compromised client, just done via access controls enforced via https auth, or ssh forced commands.

                                But if this will ever be recommended to anyone as an actual tool for securing important backups, it would need thorough consideration of the above, and likely many more, subtler potential issues.

                                Of course it is the reason this is public to begin with. I have considered this and other issues more than is written, but of course more specialized help is wanted. It does say “work in progress” after all. I don’t see why you are so sure subtle issues cannot be resolved.

                                If there is a real fundamental flaw found I am totally willing to scrap the idea completely, I just think it is an improvement of what exists currently.

                                1. 2

                                  You misread my comments as complaints :) I’m all for implementation. Given what you’re saying in your last comment, it sounds like detailed architecture documentation would be useful- I didn’t gather that any of that was going on short of looking at the code. I think that such documentation along with reasonable threat modeling and analysis will likely be the best way to encourage people to give you that more specialized help, and to avoid people like me asking the really basic questions.

                                  Best of luck!

                                  1. 2

                                    I tend to get overly defensive - $dayjob is slowing things down, but I would love to have a detailed threat model laid out and all formats and protocols fully documented.

                                    Thanks for taking the time to give feedback, It is valuable because people definitely would have questions like these.

                      2. 2

                        I’ll be closely watching this, as a happy Borg user, the passphrase-in-the-clear thing has always niggled at me.

                        1. 1

                          Passphrase-in-the-clear?

                          1. 1

                            Yes, the archive passphrase has to be stored somewhere (inside your backup script etc).

                            https://borgbackup.readthedocs.io/en/stable/quickstart.html#automating-backups

                            I don’t use pass.

                          2. 1

                            Thank you, progress will be slow as It’s difficult to prioritize OSS, but hopefully it will be steady. I have lots of existing code and prototypes already to pull ideas and code from.

                          3. 1

                            I read the “Work begins” post but couldn’t find much info on how it’s going to be implemented. Data deduplication along with client-side encryption is hard and I’m curious how this will be solved.

                            1. 3

                              Sure, I will write up and post on lobste.rs, it might be generally interesting for people even if the project doesn’t complete soon.

                                1. 1

                                  Wow none of those are what I had in mind (I think). Interesting…

                              1. 1

                                Hopefully this explains some of it… https://packnback.github.io/blog/dedup_and_encryption/