1. 21
  1.  

  2. 7

    Sometimes, rolled loops are faster too. One of the latest patch of my crypto library went like this:

    -    ctx->c[0] = load32_le(message +  0);
    -    ctx->c[1] = load32_le(message +  4);
    -    ctx->c[2] = load32_le(message +  8);
    -    ctx->c[3] = load32_le(message + 12);
    +    FOR (i, 0, 4) {
    +        ctx->c[i] = load32_le(message +  i*4);
    +    }
    

    The result was a 5% speed boost over the whole Poly1305 computation. Not just the loop itself. Why? I haven’t looked at the assembly, but I’m compiling with -O3 -march=native, so I’m pretty sure rolling my loop helped the compiler notice that it could vectorise it (basically replacing 4 32-bit loads by just one 128-bit load).


    An even more interesting example is the inner loop of Blake2b. When unrolled on modern processors, I get a 20% speed boost. But on very small chips, it can get over twice as slow.

    Thing is, this inner loop accesses a list of predefined values. When you forcibly unroll the loop, this lets the compiler do constant propagation, thus avoiding many lookups. On the other hand, this makes the whole loop about 5Kb bigger, and that could strain the instruction cache of smaller processors.

    1. 2

      I thought the point of loop unrolling was to get rid of branching by completely replacing the for loop with a long list of all the statements that would be executed. Apparently, bundling instructions within a for loop is also loop unrolling and the blog post only benchmarks this method.

      It would be interesting to see how well “full loop unrolling” would fare.

      1. 3

        The article addresses that as the limit I guess. If your work in the loop is W and your per-iteration loop over head is L then if you are looping 100 times, total work is 100W + 100L.

        Fully unrolling would mean you just do 100W. Unrolling 10 iterations means you’d do 100W + 10L -i.e. you’ve acheived 90% of the benefit. It’s diminishing returns. Also, the “full unroll” has significantly higher space requirements and only supports a maximum number of iterations. (You can do fewer by jumping part-way into the unroll).

        This is of course probably a massive simplification. My uneducated guess is that pipeline stalls, instruction interleaving, branch prediction and many other things I don’t understand well enough have a significant effect.

        However, in the analysis presented - partial unrolling gets you nearly all the benefit.