1. 14

    I for one would be fascinated by a more extensive treatment of the subject.

    1. 4

      I’ll second that. I love reading about clever trickery to make code go fast, and I’ve just recently gotten interested in GPGPU stuff and this seems like both.

      1. 1

        Same, I would love reading about this. I’ve just begun dabbling in GPGPU (and SIMD) and reading about the journey and though process of others is very enlightening.

        If you end up writing this post, I’d like to request that you include some of your mistakes/dead paths as well. Reading about how others have failed helps me to learn.

      1. 3

        Writing a blog post that attempts to answer the question “how much can you trust benchmark results from cloud-CI pipelines like Travis?” (eg. as suggested by this post by BeachApe).

        Intuitively, you might think “not much”. Well, it turns out the answer is… “not much” - but I have numbers to prove it. Benchmark results from Travis-CI are substantially noisier than equivalent benchmarks taken from a regular desktop PC. Disappointingly obvious, but it’s nice to put some data behind the intuitive answer anyway.

        It does get me thinking about whether a sufficiently-smart benchmark harness (potentially some future version of Criterion.rs, the Rust benchmarking library that I maintain) could mitigate the effects of this noise and give reliable-ish numbers even in a cloud-CI environment though.

        1. 2
          • Left my old job doing line-of-business Java stuff for satellite communications companies and found a new one doing machine learning software for circuit designers. Quite enjoying myself so far.
          • Wrote a toy raytracer in Rust: https://github.com/bheisler/raytracer
          • Improved the JIT compiler in my NES emulator: https://github.com/bheisler/Corrosion
          • Started blogging at https://bheisler.github.io/
          • Took over maintenance and development of a statistics-driven benchmarking library in Rust called Criterion.rs. It’s a great tool but it was never officially released until I got to it a few months ago. I’ve learned a lot about community management and open source, partly by trial and error and partly by asking other project maintainers.
          1. 18

            Except the intermediate values are stack-allocated in Rust and C as well. This article claims that stack-allocated data is faster than heap-allocated as well, which is dubious without further clarification. malloc is indeed slower than pushing to the stack, and the stack will almost certainly be in cache, but this program is small enough that its working set will always be in cache so there should be minimal difference.

            I’m curious what the modular function is doing in Rust - the C implementation just uses a simple (n % 2) == 0 while the Rust code does ((n % 2) + 2) % 2 and it’s not clear why.

            1. 7

              ((n % 2) + 2) % 2 properly handles negatives, though it shouldn’t matter here.