When accessing more bytes per cache line, the code is running more instructions. Each of these instructions depend on the previous as they’re all writing to the same counter. So I assume it’s actually just slower because it’s more sequential code, not because L1 cache reads are expensive.
A better benchmark would compare the sizes you can access with a single instruction, or would have multiple instructions that can be run in parallel.
When accessing more bytes per cache line, the code is running more instructions. Each of these instructions depend on the previous as they’re all writing to the same counter. So I assume it’s actually just slower because it’s more sequential code, not because L1 cache reads are expensive.
A better benchmark would compare the sizes you can access with a single instruction, or would have multiple instructions that can be run in parallel.