If you’re benchmarking a really small piece of code in a loop, you also need to be aware that
you can spend 50% of the time per iteration on loop overhead if you don’t unroll and
CPUs execute multiple instructions in parallel, so if you aren’t careful measuring the time to do 1000 iterations and dividing by 1000 will underestimate the time to run one lone iteration (throughput vs latency)
If you’re benchmarking a really small piece of code in a loop, you also need to be aware that