This was submitted last month, but went largely unnoticed (or at least uncommented).
It’s a long read, so here’s the takeaway:
The first half of the post is fairly detailed overview of your typical x86_64 CPU and the surrounding hardware. I enjoyed this part, but you could probably skip it to save time.
The author then describes a small and constrained research kernel he has written and using this presents a way to measure architectural performance counters on a per-cycle basis. Using this setup he plots cycle number vs. count for various performance counters for various asm benchmarks and is able to observe micro-architectural events within the CPU.
What fascinated me most was that you could actually see speculative execution at work. The last benchmark contains a loop which repeatedly loads from memory, however, the benchmark is constructed such that the reads should not be committed. You can see the reads occurring at the uop level, but the retired instruction count confirms that these reads were never more than mis-speculation. How neat!
And here’s a quote which amused me:
if a modern x86 processor at 2.2 GHz had all caches disabled, it would never be able to execute more than ~15 million instructions per second. That’s as slow as an Intel 80486 from 1991.