What an excellent article! Not just for the result but for the detailed explanation of how this kind of optimization is done. First improve the function you’re calculating. Then optimize the code so it runs well on a modern CPU. I’m particularly impressed the vectorizing compiler does so well with just a few accommodations, no weird assembler required to get 90%+ of the improvement.
It’s common for vendor compilers (IBM, Intel, etc.) to replace calls to standard library functions with calls to optimized primitives.
Here are some results for the original routine with Intel’s compiler; I added restrict to all pointer arguments
The author of the article is also on lobste.rs: hi @francesco!
Reflecting what @nelson wrote, I also liked the depth of the article. Through this article I also found out about https://uica.uops.info/, that tool is super useful for figuring out how different code behaves.
Quick tell the NumPy and Matlab development teams :D
@francesco: Nice artlcle! I’ve noticed a typo: you once quote the branches per element number as 0.3 instead of 0.13.
Indeed, thanks, I’ve fixed it.
First, this is a good idea. Second, this is such a good idea that people already did it, not only for atan2f, but for all functions in math library. Use SLEEF Vectorized Math Library.