1. 2
  1.  

  2. 2

    When accessing more bytes per cache line, the code is running more instructions. Each of these instructions depend on the previous as they’re all writing to the same counter. So I assume it’s actually just slower because it’s more sequential code, not because L1 cache reads are expensive.

    A better benchmark would compare the sizes you can access with a single instruction, or would have multiple instructions that can be run in parallel.