1. 86

  2. 11

    Regarding a Rust version of the Argon2 module: check out the Orion crate. It’s pure Rust and has easy-to-use APIs for various modern crypto primitives, including Argon2i and XChaCha AEAD. I also know that the maintainer is extremely interested in adapting the API to make it easier and fit people’s use cases.

    1. 10

      Aside: I wish we could banish the word ‘modern’ from software descriptions. It communicates next to nothing.

      1. 11

        Reminds me of how fish’s webpage still describes it as “Finally, a command line shell for the 90s” :)

        1. 13

          Fish started in the 2000’s, so I believe that’s a joke about the pace of UX advancement in text shells.

        2. 7

          Actually I find it useful when it’s used in the context of C++ libraries. The “modern library” tag suggests to me that it uses one of the newer C++ dialects, doesn’t use autotools, is more user-friendly, etc.

          1. 2

            This is a pet peeve of mine as well. It strikes me as a weasel word. To me, a good sniff test would be to imagine reading a description of a utility from a 20 year old archive. Would the word “modern” help the reader understand the tool? Probably not.

            1. 1

              I can understand the technical idea. But for me it does make sense for archive formats (and looking at the date).

            2. 8

              How recoverable is a corrupted archive? Is data lost only at the corrupted location, or would the whole archive be lost?

              1. 1

                Depending on the corruption, it may all be lost. The archive is validated but has no error-correction metadata. I pondered raid-like wrapper bottles, but haven’t done anything about them yet.

                1. 1

                  The de-duplicated database sounds similar to solid compression, which would lose the whole archive on a small damage, but streaming aspect makes me wonder if it’s organized in a way that enables resiliency.

              2. 5

                Very cool! Works well. Gonna use this for a local sneakernet I’m running. 50GB of data encrypted in 3.2 minutes.

                Do you think this code could benefit from parallelization? Make it even faster?

                Also, any plans to publish to crates.io? It’d be really handy if I could just say cargo install bitbottle

                1. 6

                  I’ll add “figure out crates.io” to my to-do list. :) I’m not sure if parallelization would help; the bottleneck seems to be disk I/O or LZMA2, and both are serial. I’m also worried about the complexity cost of adding concurrency, unless it makes a huge difference.

                  1. 4

                    the bottleneck seems to be disk I/O or LZMA2, and both are serial. I’m also worried about the complexity cost of adding concurrency, unless it makes a huge difference.

                    For what it’s worth, the zstd library includes multithreaded compression (and Rust bindings exist).

                    It and brotli also fill the space of “denser than Snappy, but faster than LZMA2” and have faster decompression than LZMA2 even at their denser settings. zstd started from the fast side (it’s from Yann Collet, LZ4’s author) and brotli started from the dense side (one of the authors is Jyrki Alakuijala, a Google compression specialist), but both offer a wide range of speed/density tradeoffs now.

                    1. 1

                      Ye, I’d look into zstd or lz4 instead of lzma2.

                  2. 5

                    The container format (bottle) is easily extensible for future compression or encryption algorithms.

                    Why would you tie your archive format to your compression and encryption? Seems like a step backwards to the ZIP old days…

                    1. 13

                      It’s helpful if you want to be able to extract/decrypt a single file. If encryption is a layer on top of the archive, like .tar.gz.age, then you have to decrypt the (potentially very large) archive just to get a single file.

                      Same goes for compression.

                      1. 10

                        There’s a tradeoff. With the tar + {gzip,bzip2,xz,lz4,…} model, you can treat the entire file as a single stream and the compression can easily handle cross-file redundancy. If you have a hundred files of English text then your compression dictionary from the first one can be reused across all of them. The downside from this is that you need to decompress before you can access metadata. A zip file, in contrast, effectively does the compress step separately: the metadata is at the end of the file and contains the index. You can pull a single file out of a zip, which is why it’s used as the basis for things like OpenOffice documents and Java Jar files: it’s easy to pretend that a zip file is a read-only filesystem.

                        There are some interesting middle points in the design space. Some modern compression algorithms (e.g. lz4 and zstd) provide a ‘dictionary mode’ where you can build a dictionary separately and then use that to compress individual files. You could build a compression format on top of this by doing two passes: scan the files to build a dictionary and then do file-at-a-time compression and store the dictionary, the individual compressed files, and a metadata dictionary. You’d then be able to load the dictionary into memory and do fast random reads within the container. You can do this with existing tools and a bit of scripting, if you run, say, zstd over your input files in dictionary mode, then compress each file individually and add the compressed files + the dictionary to a tarball (or pax archive). That doesn’t require a new format, just a modification to the tool. I’m quite tempted to try adding this mode to FreeBSD’s tar.

                        I didn’t read too much detail about what this thing is doing because the crypto stuff made me super nervous (there are a lot of difficult problems in designing an encrypted container and the README didn’t contain any discussion of them) so I’m not sure exactly where this sits in the space.

                        1. 1

                          I like your instincts about crypto. :) The mechanism is documented in format.md here.

                          I tried to stick with common patterns and algorithms, with a heavy weight toward DJB’s (of NaCl) work, because it’s often both faster and feels more trustworthy. This is why some of the terminology (“sealed box”, “xchacha”) is weird. (I also like SSH better than GPG because it just feels easier to use.) I’ve run this code by people I trust in this space, but will always welcome new criticism/advice – it’s been a long long time since paramiko.

                          1. 2

                            Thanks. That doesn’t look obviously wrong (and I’m not sufficiently qualified to tell you if it is non-obviously wrong), but it feels like it’s a bit pointless as an integrated thing. If the entire file is encrypted as a stream then you lose the benefits of a structured format. You may as well just do the tar.gz thing and wrap your archive in a separate encryption format. Then you have complete crypto agility (your archive format doesn’t need to know anything about your crypto because it only ever sees plaintext).

                            To me, the value of folding the encryption into the archive format would come from being able to separately decrypt individual files. If I have a 50 GiB archive, being able to separately decrypt the metadata and then decrypt and decompress individual files without having to stream-decrypt the whole thing would be useful. Unfortunately, that’s really hard to get right.

                            By the way, from the writeup it looks as if you’re using NaCl? I’d really recommend looking at libsodium: the APIs are much improved over the original and are very difficult to misuse.

                            1. 1

                              Yeah, the goal of the unified file format was to make it easier for my friends to do a basic encrypted, compressed archive of a folder without having to use several different tools, some of which (gpg) are user-antagonistic, and some of which (tar) have a ridiculously fragile format. I realize this isn’t for everyone.

                              Per-block encryption should be possible the same way as per-archive in the current scheme, just moving the bottle down a few layers. It would mean each block has its own key, each sealed for each recipient, so the overhead would increase slightly, but I don’t see why that wouldn’t be okay… though I haven’t tried implementing it. :)

                              I believe libsodium is just an alternate implementation of the NaCl algorithms. I’m actually using “dryoc”, which is a rust port that’s probably different on a few other axes.

                              [edit because my thumb hit “send” before I was done]

                              1. 1

                                Per-block encryption should be possible the same way as per-archive in the current scheme, just moving the bottle down a few layers. It would mean each block has its own key, each sealed for each recipient, so the overhead would increase slightly, but I don’t see why that wouldn’t be okay

                                For one thing, it would be difficult to avoid leaking the sizes of the individual files (or you need to explicitly ensure that this isn’t part of your threat model). Similarly, if the metadata is separately protected then you have to be very careful about vulnerability to known-plaintext attacks (this is what killed the original zip encryption), because the metadata contents is often easily guessable (so is the contents of an individual file). I believe most of the constructions in NaCL should be resilient against this kind of thing but it’s the sort of thing that I’d want to see in a threat-model document.

                                I believe libsodium is just an alternate implementation of the NaCl algorithms.

                                I believe it is the opposite: it uses the implementation of the algorithms from NaCl (and is, in fact, a fork of NaCl, not a reimplementation), but exposes an API that is much harder to misuse. Some of the comments in your README suggest that NaCl is making you think about things that Sodium explicitly makes sensible decisions on and avoids exposing to the user unless they want to go past the high-level APIs. Not sure if dryoc does the same thing, but there are direct libsodium bindings available for Rust.

                      2. 4

                        File contents are stored as a database of de-duplicated chunks using buzhash, similar to common backup utilities.

                        Interesting, I always wondered how they did that…whether there was a “standard solution” or did people just wing it?

                        Is this also how GridFS splits files into 16mb chunks?

                        1. 6

                          The standard method is to split the file using a “rolling hash.” Rsync also uses this method for synchronizing files remotely. Interestingly, the whole area of rolling hashes is a bit of a software patent minefield; it’s probably better to make some random thing up than to try to use an existing rolling hash, since the existing one might well fall under some overly-broad software patents.

                          1. 2

                            I don’t believe that rsync uses variable-length chunks based on a rolling hash signature like leading/trailing zeroes, unless it’s changed since I looked at it many years ago. I believe it computes hash signatures (for both rolling and crypto hashes) over fixed-size chunks, so it can’t benefit from the optimization that “content-based chunking” affords for changes that don’t change the contents of a chunk, only its displacement within the file.

                            1. 1

                              It’s in the wikipedia page for rsync. I haven’t actually looked at how it works though.

                              1. 1

                                Ah right, I’d forgotten that detail, thanks for the pointer. The algorithm still uses fixed-width chunks though, not “content-based chunking” (which, again, determines chunk boundaries based on bit patterns in the rolling hash like leading/trailing zeroes). Content-based chunking doesn’t produce chunks of fixed width, just fixed average width.

                            2. 2

                              It’s also not particularly difficult, I built myself one a few years ago just to make sure I understood the concept. It’s super-useful.

                            3. 4

                              The GridFS page you linked to says “By default, GridFS uses a default chunk size of 255 kB; that is, GridFS divides a file into chunks of 255 kB with the exception of the last chunk. The last chunk is only as large as necessary.” So it doesn’t sound like there’s any rolling hash going on. The page does mention the size 16MB, but only as the size limit(!) of the BSON format.

                              1. 2

                                ah ok, makes sense.

                                size limit(!)

                                yeah, BSON does indeed have a size limit. but it’s per-document, not per-field. as most single documents don’t get anywhere near that size limit, it’s rare to find GridFS users, at least in my experience. at a past job we tried to store images on GridFS but it offers zero additional benefits over just uploading to GCS or S3.

                            4. 4

                              If you build something new with SSH keys, look into also supporting the ed25519-sk keytype, which is the FIDO based smartcard (yubikey etc) key type. Then you can ecrypt your archive with a hardware token.

                              1. 2

                                That would be cool! I wish this stuff was documented, it was quite a pain to get even the standard Ed25519 keys working.

                              2. 3

                                All important posix attributes are preserved (owner & group by name, permissions, create/modify timestamps, symlinks).

                                I can’t really imagine a scenario where it makes sense to preserve owner and group info at all. Owner and group really only make sense in the context of a given machine. The point of an archive file format is to move groups of files from one machine to another. You might want owner and group as part of a backup, but a backup is a different thing than an archive. Backups need to be snapshots that you can add to incrementally, etc. and not “here is a file that contains some other files.”

                                1. 3

                                  During the (long) course of this project, I’ve mostly come around to this point of view. I’m probably going to make user/group storage optional, and off by default.

                                2. 2

                                  How streaming friendly is it? Can i stream a 40GB file from curl into this?

                                  1. 6

                                    “Can” you? Yes. I intentionally made bitbottle/unbottle use stdin/stdout by default where it makes sense, so they can be used in pipes. The entire file format is a stream.

                                    However! “Do I want to” probably depends on how fast you can download 40GB. I live in a tech metropolis, so we have terrible internet service and I wouldn’t enjoy that experience. :)

                                    1. 3

                                      The other half of the question being posed is can it stream without buffering and arbitrarily large amount of data? i think the number “40GB” was plucked out of thin air as “a number of bytes that is larger than RAM + swap”.

                                      It’s an interesting question because being able to stream writes with roughly O(1)-ish memory (you just need working memory to hold the directory, not the files contents) is a large part of the reason why zip put the directory at the end of the file.

                                      1. 2

                                        Buffering should be limited to the size of the file list, or the maximum block size (4MB right now, but tunable).

                                        I handle this by doing multiple passes in “create archive” mode. The first pass assembles the list of files and scans them with buzhash to create a list of block hashes, so the entire file list and block list must fit in memory. (You could offload this into a temporary db file but so far it seems fine.) The second pass streams out the file & block list, then the contents of each block (and its hash) as they are read from disk.

                                        For example, the Star Trek V rifftrax (350MB) uses 267 blocks, and unbottle --dump shows the file’s block list as:

                                        Data stream: 2063640700a020b7bd906092e7043dbae812ad7b3c4e7cd4e22da84b3c657ad3313524f7d7e53d
                                        Data stream: 2060500a00a0202e65ac663933f3413a10c81dd825491a2438d918135f19bcd9e707b944eaea36

                                        which decodes to

                                        • block size 0x76463, blake 3 hash b7bd9060…
                                        • block size 0xa5060, blake 3 hash 2e65ac66…

                                        “Expand archive” can happen in one pass, because it can scan the file list first, keeping the metadata and block lists for only the files it cares about, then when it gets to the block data, it can skip blocks it doesn’t care about, and use pwrite to put everything in its right place.

                                        This code lives in archive.rs.

                                        1. 1

                                          Aha, that’s great, thank you.

                                          I take it that they means I’m not going to get good results if I ask it to output directly to signing unseekable like a socket?

                                          1. 2

                                            Right, you’ll want something seekable on the receiving end, since the blocks may arrive out of order for some files.

                                            1. 1

                                              There’s an older compression program called rzip which I think did something similar; it has the “you can only output to a regular seekable file, not a socket” contract too.

                                    2. 2

                                      I expect the deduplication method wouldn’t go well with streaming a file larger than memory.

                                    3. 2

                                      Are there any options for storing Extended Attributes?

                                      1. 2

                                        No, but that’s an interesting idea, if there’s a cross-platform way to read & write them.

                                        1. 1

                                          Unfortunately it’s OS-dependent because nobody really uses them. I don’t know how well supported they are, Linux has standard kernel system calls, as does MAC OS X. The BSDs do too, (You can check out exattr.c for implementation details on how I wrapped it). I’m not sure how Windows handles this with respect to NTFS streams.

                                          1. 1

                                            And the different platforms have pretty different constraints on them so a unified API would be rough

                                            1. 1

                                              No, that’s filesystem-dependent not platform-dependent