1. 10
    1. 3

      Looks neat.

      Have you considered using Monocypher? Its Argon2 implementation is much simpler than the reference implementation, and therefore likely smaller too. And last time I checked the non-SIMD variant was a bit faster as well. Now Monocypher is a whole library, but all you need to do to use it is cut the code you don’t use (basically everything except Blake2b and Argon2).

      The only caveat is that Monocypher doesn’t have threads, but I’ve seen a paper pointing out that running several single-lane instances of Argon2 in parallel (and then hashing the results) is a bit more secure than running the corresponding multi-lane instance. (Besides, if you aim for compatibility not all implementations support multiple lanes to begin with.)

      1. 3

        Thanks! :)

        I wasn’t quite aware that Monocypher had a completely independent implementation, I’ll definitely look into making the switch! I’m not supporting SIMD or threads either way (well, I could use SIMD autovectorization if it’s beneficial here) so this should be easy.

        UPD: ha, seems bigger with -O3, it ends up as 18137 bytes versus 12727 for the reference implementation.. until gzipped, in which case monocypher wins 5917 vs. 6368, and even a SIMD-autovectorized version is just 6661 bytes gzipped. I’ll do performance benchmarking later and pick a winner :)

        1. 2

          One thing I forgot: while Argon2 itself is simplest, there is room for simplification in my Blake2b implementation:

          • Its loading code (crypto_blake2b_update()) is optimised for speed, which is completely unnecessary in the context of Argon2 password key derivation. I can simplify it for you once you’ve integrated it if you’re interested (I have a kickass test suite, so this should go pretty quick).
          • The core loop is unrolled by default, which bloats code size by several KiB. Compile with -DBLAKE2_NO_UNROLLING to roll the loop again, or just delete the code you don’t want.
          1. 1

            Yeah, unrolling is what I assumed when I saw that it gzipped well. I’ll keep it since I do ship it gzipped. (I guess I could use it if I make a separate golf branch for an absolute-minimum-size variant, haha)

            re: the first thing – hm, well, I guess another thing I could do is expose blake2b to the user nearly for free – would the optimization be unnecessary for regular hashing of complete buffers as well? In that case I guess if you could introduce a -DBLAKE2_NO_FAST_LOADING it would be cool.

            (I’m not editing the code, I’m using a git submodule as-is because I like it that way; there’s no need to remove anything from the file, I just let clang shake off everything that’s not reachable from chosen exported functions.)

            1. 1

              I guess another thing I could do is expose blake2b to the user nearly for free – would the optimization be unnecessary for regular hashing of complete buffers as well?

              Oh no, I wouldn’t have kept it otherwise. Compared to the simplest byte-by-byte loading code, I get an overall 30% difference or so. It’s pretty massive. If you chose to expose BLAKE2b, to your users, you should probably keep it as is.

              there’s no need to remove anything from the file, I just let clang shake off everything that’s not reachable from chosen exported functions

              Oh, that’s nice. And it removes a major con about Monocypher being a (mostly) single-file library.

              1. 2

                Whee, it’s done, pushed version 2.0.0. It’s ~2.3KiB smaller, but this did come at a little performance cost. So turns out neither -O3 vs. -Oz nor keeping the unrolling could really benefit performance that much (only like 10ms here) and cost a lot of space, and autovectorization sucked, so I ended up going with “the code golf option” as the only one.

                The results (Deno 1.34.3, Ryzen 5850U on AC power):

                benchmark                time (avg)             (min … max)       p75       p99      p995
                ----------------------------------------------------------- -----------------------------
                argon2id 1 256M         391.22 ms/iter (361.92 ms … 434.52 ms) 397.16 ms 434.52 ms 434.52 ms
                ref.argon2id 1 256M     362.85 ms/iter  (342.3 ms … 384.14 ms) 370.47 ms 384.14 ms 384.14 ms
                hw.argon2id 1 256M      383.57 ms/iter (365.91 ms … 415.49 ms) 386.25 ms 415.49 ms 415.49 ms
                

                ref being my previous version and hw being this project I just found; I couldn’t get argon2-browser to run in deno via npm, and argontwo was waaay slower (1.6s).

                Were 2.3KiB worth 30ms? I guess, since I’m positioning this as size-optimized :)

                1. 1

                  Yeah, the cost of BLAKE2b is basically a constant (inputs don’t scale, only memory hardness does), so it’s normal that you hardly see any difference when playing with loop unrolling. Loop unrolling only matters if you use BLAKE2b on its own, without Argon2.

                  Its’ a bummer that my implementation is slower than the reference one, perhaps I should investigate. I wonder, could you run those same tests with even more memory, so we can see how this scales? If we’re lucky I’m only a constant behind, but it’s possible the reference implementation has a better core loop.

                  Saving 2.3KiB is amazing, I didn’t expect that much to be honest.

                  1. 1

                    Seems to generally be about 5% slower:

                    benchmark                 time (avg)          (min … max)         p75       p99      p995
                    ------------------------------------------------------------- -----------------------------
                    argon2id 1 8M           10.71 ms/iter     (8.84 ms … 15.5 ms)  11.22 ms   15.5 ms   15.5 ms
                    ref.argon2id 1 8M       10.31 ms/iter    (8.71 ms … 13.01 ms)  11.95 ms  13.01 ms  13.01 ms
                    argon2id 1 256M         389.35 ms/iter (375.97 ms … 403.03 ms) 394.35 ms 403.03 ms 403.03 ms
                    ref.argon2id 1 256M     370.05 ms/iter (358.56 ms … 391.13 ms) 377.41 ms 391.13 ms 391.13 ms
                    argon2id 1 1G            1.65 s/iter       (1.62 s … 1.67 s)    1.66 s    1.67 s    1.67 s
                    ref.argon2id 1 1G        1.56 s/iter       (1.54 s … 1.61 s)    1.56 s    1.61 s    1.61 s
                    argon2id 1 2G            3.31 s/iter       (3.28 s … 3.37 s)    3.32 s    3.37 s    3.37 s
                    ref.argon2id 1 2G        3.07 s/iter       (3.05 s … 3.11 s)    3.09 s    3.11 s    3.11 s
                    
                    1. 1

                      Strange… it looks like my implementation slowly gets worse and worse as you step up the memory… Clearly my implementation is not optimal.

                      Ah, I see, that’s because I always measure with -O3 (native) and you’re using -Os (Wasm). I’ll try and measure at -Os, see if there’s some small improvement I can spur without sacrificing -O3. If I’m successful I’ll put the result in the next release.

                      1. 2

                        -O3 doesn’t help here as I’ve already mentioned. Here I reran all the sizes just changing -Oz to -O3 and nothing more:

                        benchmark                time (avg)             (min … max)       p75       p99      p995
                        ----------------------------------------------------------- -----------------------------
                        argon2id 1 256M         395.65 ms/iter (381.18 ms … 406.08 ms) 400.99 ms 406.08 ms 406.08 ms
                        ref.argon2id 1 256M      378.7 ms/iter (361.57 ms … 413.97 ms) 380.32 ms 413.97 ms 413.97 ms
                        argon2id 1 1G            1.67 s/iter       (1.65 s … 1.71 s)    1.68 s    1.71 s    1.71 s
                        ref.argon2id 1 1G        1.59 s/iter       (1.56 s … 1.69 s)     1.6 s    1.69 s    1.69 s
                        argon2id 1 2G             3.4 s/iter       (3.36 s … 3.49 s)    3.43 s    3.49 s    3.49 s
                        ref.argon2id 1 2G        3.22 s/iter       (3.18 s … 3.36 s)    3.23 s    3.36 s    3.36 s
                        

                        and out of curiosity

                        -O1 1 256M         429.07 ms/iter (410.42 ms … 454.49 ms) 438.14 ms 454.49 ms 454.49 ms
                        -O0 1 256M            1.85 s/iter        (1.78 s … 2.1 s)    1.82 s     2.1 s     2.1 s
                        
                        1. 2

                          Okay, after investigating a little it seems my compiler is kind of a derp, and failed to perform what I thought should have been an obvious optimisation. Thus, this patch gained me a ~5% improvement on Argon2i.

                          Long story short, given the following 64-bit numbers a and b, and mask = 0xffffffff, we want to compute 2 * (a & mask) * (b & mask). The fastest way to do this is to perform a 32->64 multiply on a and b, then shift the whole thing by 1. But my compiler does not do that. I didn’t check, but my guess is that it doesn’t take full advantage of unsigned multiplication’s commutativity when powers of two are involved, and instead does the shift before the multiplication, forcing a more expensive 64->64 multiply.

                          Replacing the multiplication by an explicit shift done at the very end got me the 5% speed I was missing all these years. I haven’t tested, but I believe you should get your 5% back if you take this patch. Or you can wait for the next release. I have an EdDSA optimisation I’d like to try first, then I’ll publish it.

                          In any case those 5% matter, thanks a ton for the tip.

                          1. 1

                            Interesting… I’ve just taken a look at their core loop, their rounds are slightly different from mine, maybe in a way that helps their compiler generator? I’ll try something when I get the chance.

        2. 2

          How much memory do browsers allow to Wasm? One of the nice things about Argon2 is that you can tune it to need 1 GiB (or more) RAM, which makes it very hard for an attacker to scale up the number of concurrent attacks when they try to brute-force the password but I wonder what practical limits you’ll see here (particularly doing this on the client, which may be a mobile device with limited physical RAM).

          1. 1

            The hard limit is 4GiB right now, but memory64 will happen soon.

            1. 1

              4 GiB is the maximum that a WAsm memory can express due to using 32-bit offsets but do web browsers really allow 4 GiB WAsm memory objects in a web page? iOS doesn’t even allow native apps that much physical memory.