Writing a blog post that attempts to answer the question “how much can you trust benchmark results from cloud-CI pipelines like Travis?” (eg. as suggested by this post by BeachApe).
Intuitively, you might think “not much”. Well, it turns out the answer is… “not much” - but I have numbers to prove it. Benchmark results from Travis-CI are substantially noisier than equivalent benchmarks taken from a regular desktop PC. Disappointingly obvious, but it’s nice to put some data behind the intuitive answer anyway.
It does get me thinking about whether a sufficiently-smart benchmark harness (potentially some future version of Criterion.rs, the Rust benchmarking library that I maintain) could mitigate the effects of this noise and give reliable-ish numbers even in a cloud-CI environment though.
Except the intermediate values are stack-allocated in Rust and C as well. This article claims that stack-allocated data is faster than heap-allocated as well, which is dubious without further clarification. malloc is indeed slower than pushing to the stack, and the stack will almost certainly be in cache, but this program is small enough that its working set will always be in cache so there should be minimal difference.
I’m curious what the modular function is doing in Rust - the C implementation just uses a simple (n % 2) == 0 while the Rust code does ((n % 2) + 2) % 2 and it’s not clear why.
I for one would be fascinated by a more extensive treatment of the subject.
I’ll second that. I love reading about clever trickery to make code go fast, and I’ve just recently gotten interested in GPGPU stuff and this seems like both.
Same, I would love reading about this. I’ve just begun dabbling in GPGPU (and SIMD) and reading about the journey and though process of others is very enlightening.
If you end up writing this post, I’d like to request that you include some of your mistakes/dead paths as well. Reading about how others have failed helps me to learn.