Interesting, but is the reason this works different from the reason ye olde manual loop unrolling works?
Loop unrolling doesn’t always give you this benefit.
This technique is about rewriting an algorithm to be branchless and to separate memory accesses that are dependent on each other with independent work (here by interleaving iterations). Doing both allows the processor to request lots of memory locations in parallel.
Unrolling reduces the number of branches but doesn’t necessarily interleave independent memory accesses.