The formula I came up with that matches my Haswell Xeon pretty well is: (L1 + D1 hits) + 5 × (LL hits) + 30 × (RAM hits)
KCachegrind, a popular GUI for visualizating cachegrind profiles, has a built-in Cycle Estimation formula that you might try. It’s nicely described in a mailing list post that also goes into some of the issues with such formulas. The formula at the end of the post is still used today.
Are there are alternatives to Cachegrind?
Another alternative are system simulators like gem5. Simulators are mainly used by CPU designers and are very slow from what I’ve been able to gather. Nevertheless, some simulators could be fast enough for micro-benchmarks that run for a small number of iterations (reliable results make this possible). Relevant papers:
Oh, neat, prior art, thank you! I will look at the formula and in more detail tomorrow; at first glance I suspect it’s obsolete, since it uses branch misses, and Cachegrind’s branch predictor is vastly inferior to modern CPUs based on my tests. And I will look at the simulators.
Initial research suggests most of these simulator projects are too difficult to use (full OS), or dead, except maybe gem5. But! Following the search trail, turns out there’s a recent ACM Computing Survey on cache simulators, with a companion GitHub repo summarizing the tools they discuss:
Sniper looks especially promising. It’s among the fastest and most accurate, and can be used like cachegrind for profiling rather than for hardware research (it can actually output callgrind files). Check out this example profile. One issue is the license: the interval model can’t be used for commercial use without obtaining permission.
I’d love to see a gem5 configuration that’s useful for this kind of thing packaged up in an easy-to-use way. I’m deeply skeptical of CacheGrind: it simulates the kind of cache design that you learn about in an undergraduate computer architecture course, which has almost nothing in common with a modern CPU cache. gem5 has some much more realistic models (though, as a result, it’s also very slow - particularly if you also use a realistic pipeline model).
The other bit missing from this description was what statistics to apply once you’ve got a sensible number of samples. FreeBSD includes ministat in the base system for benchmarking. This takes a set of samples and tells you what the difference between them is at 95% confidence. Unfortunately, in common use, it computes the standard deviation of the two sets of samples, determines that they are different, and then runs Student’s T-test (which is valid only for comparing normal distributions with the same standard deviation) on them to determine the confidence intervals. Welch’s T-test lets you compare sets of samples with different standard deviations but expects normal distribution. A lot of benchmarking results are not normally distributed, they’re a asymmetric because there are a bunch of things (cache contention, interrupts) that will make them slow and so have a cluster around the maximum speed and a long tail to the right, but not a tail of faster things to the left. To make that worse, you my not actually care about median performance as much as you care about worst-case performance. In a distributed system, overall throughput often depends on the tail latency of the slowest component and so you may be willing to take a 5% slowdown in the average case if it improves your worst case (or vice versa, depending on how your component is being used).
I wouldn’t use this for the actual “I am going to optimize this code” stage, necessarily, but it seems like it’s good enough for CI error that “you just make your code slower”. Or at least, a lot better than nothing!
KCachegrind, a popular GUI for visualizating cachegrind profiles, has a built-in Cycle Estimation formula that you might try. It’s nicely described in a mailing list post that also goes into some of the issues with such formulas. The formula at the end of the post is still used today.
Oh hey - FWIW, I’ve been recently submitting some patches and making some Windows/Mac binary builds for a work thing. Latter will be available soon, but most of the former have been merged in. Let me know if you’re curious about anything; I can look into it for you.
Interesting approach. I wonder: is this much different from running perf stat to find out the number of instructions executed (and other cache-related metrics, I suppose) by a running process?
If you’re using consistent hardware then yeah, that kinda works (you may end up sharing caches with other running processes…).
In a series of runs on different cloud VMs, though, while instructions will probably (at a guess, untested) be fairly consistent, cache misses will be less consistent:
Cache sizes may be inconsistent across VM instances.
Nice writeup @itamarst!
KCachegrind, a popular GUI for visualizating cachegrind profiles, has a built-in Cycle Estimation formula that you might try. It’s nicely described in a mailing list post that also goes into some of the issues with such formulas. The formula at the end of the post is still used today.
Another alternative are system simulators like gem5. Simulators are mainly used by CPU designers and are very slow from what I’ve been able to gather. Nevertheless, some simulators could be fast enough for micro-benchmarks that run for a small number of iterations (reliable results make this possible). Relevant papers:
I’d be interested in hearing if anyone has experimented with using a system simulator for reliable and more accurate benchmarking on CI.
Oh, neat, prior art, thank you! I will look at the formula and in more detail tomorrow; at first glance I suspect it’s obsolete, since it uses branch misses, and Cachegrind’s branch predictor is vastly inferior to modern CPUs based on my tests. And I will look at the simulators.
Initial research suggests most of these simulator projects are too difficult to use (full OS), or dead, except maybe gem5. But! Following the search trail, turns out there’s a recent ACM Computing Survey on cache simulators, with a companion GitHub repo summarizing the tools they discuss:
So next going to read that.
Nice find!
I also found a more recent and thorough simulator survey from the authors of the paper I linked to above: A Survey of Computer Architecture Simulation Techniques and Tools.
Sniper looks especially promising. It’s among the fastest and most accurate, and can be used like cachegrind for profiling rather than for hardware research (it can actually output callgrind files). Check out this example profile. One issue is the license: the interval model can’t be used for commercial use without obtaining permission.
Unfortunately it also seems, like many of these tools, to rely on Intel’s Pin for recording, which is also not licensed for commercial use by default.
I’d love to see a gem5 configuration that’s useful for this kind of thing packaged up in an easy-to-use way. I’m deeply skeptical of CacheGrind: it simulates the kind of cache design that you learn about in an undergraduate computer architecture course, which has almost nothing in common with a modern CPU cache. gem5 has some much more realistic models (though, as a result, it’s also very slow - particularly if you also use a realistic pipeline model).
The other bit missing from this description was what statistics to apply once you’ve got a sensible number of samples. FreeBSD includes
ministat
in the base system for benchmarking. This takes a set of samples and tells you what the difference between them is at 95% confidence. Unfortunately, in common use, it computes the standard deviation of the two sets of samples, determines that they are different, and then runs Student’s T-test (which is valid only for comparing normal distributions with the same standard deviation) on them to determine the confidence intervals. Welch’s T-test lets you compare sets of samples with different standard deviations but expects normal distribution. A lot of benchmarking results are not normally distributed, they’re a asymmetric because there are a bunch of things (cache contention, interrupts) that will make them slow and so have a cluster around the maximum speed and a long tail to the right, but not a tail of faster things to the left. To make that worse, you my not actually care about median performance as much as you care about worst-case performance. In a distributed system, overall throughput often depends on the tail latency of the slowest component and so you may be willing to take a 5% slowdown in the average case if it improves your worst case (or vice versa, depending on how your component is being used).I wouldn’t use this for the actual “I am going to optimize this code” stage, necessarily, but it seems like it’s good enough for CI error that “you just make your code slower”. Or at least, a lot better than nothing!
I found https://github.com/darchr/gem5-skylake-config, would love to hear to how well it works (and performance).
Oh hey - FWIW, I’ve been recently submitting some patches and making some Windows/Mac binary builds for a work thing. Latter will be available soon, but most of the former have been merged in. Let me know if you’re curious about anything; I can look into it for you.
Interesting approach. I wonder: is this much different from running
perf stat
to find out the number of instructions executed (and other cache-related metrics, I suppose) by a running process?If you’re using consistent hardware then yeah, that kinda works (you may end up sharing caches with other running processes…).
In a series of runs on different cloud VMs, though, while instructions will probably (at a guess, untested) be fairly consistent, cache misses will be less consistent:
Cache sizes may be inconsistent across VM instances.
GitHub Actions says it uses Standard_DS2_v2 Azure instances, and if you look at Azure docs (https://docs.microsoft.com/en-us/azure/virtual-machines/dv2-dsv2-series?toc=/azure/virtual-machines/linux/toc.json&bc=/azure/virtual-machines/linux/breadcrumb/toc.json) it seems this involves 4 different CPU models, across 4 generations of Intel generations, with differing L2 and L3 cache sizes.
You’re sharing hardware with other VMs, which can impact the processor’s caches, in particular L3.
Found this empirical research where they tried to measure cache sizes on cloud instances over time, and it’s noisy, especially for L3: http://www.cs.cornell.edu/courses/cs5412/2018sp/recitation_slides/cs5412-cache-weijia.pdf (slide 22). Seems like if it was just about CPU models you wouldn’t see that level of variation.
(I’ll update the article with this info at some point).
Oh, this is very Interesting and relevant to a future project of mine. Thank you for doing the deep digging. I didn’t expect such a detailed response!