Store forwarding! A rare and highly unpredictable adversary.
I wonder if it would have run as fast as the 32 byte version if the compiler emitted either a non-inlined call to memcpy or even seven 4-byte MOVs instead of two 16-byte MOVs? I would kinda expect this to be limited by memory bandwidth either way?
This was a very good and in-depth post. I learned that perf had a ton of counters. Thanks!