1. 23

  2. 9

    This is about CPython builds specifically. The performance will change even more significantly when switching between Cython, PyPy, GraalVM, Jython, whatnot.

    1. 4

      Yeah I noticed this when working with CPython for Oil. I rewrote the build system from scratch and removed a bunch of the code to simplify it, but the result is slower than the stock CPython build (on my distro, Ubuntu).

      The benchmarks have shown a consistent difference for years, around 10%. Like the article, I attributed that to lack of profile-guided optimization, but it could be other things too.

      This is one reason that “oil-native” doesn’t rely on any CPython at all. The codebase is very useful and time tested but it’s not really reusable.

      1. 4

        App I once worked on, which was CPU bound and only worked on python 2.4, got either 50% or 100% faster (can’t remember which) when switching from RHEL6 to RHEL7 on identical hardware. As far as I could tell this was mainly just from the newer version of gcc.

        That was a weird job in retrospect.

        1. 2

          There was a paper a few years ago that showed significant speed differences in user programs just by changing how the vm was linked together at build time (this was Java IIRC, but same difference..) They showed that certain order of object files was faster than others with exact same compiler options. Pretty crazy. I will try to find the paper and link it here.

          1. 5

            https://www.cis.upenn.edu/~cis501/previous/fall2015/papers/producing-wrong-data.pdf talks about how environment variables(!) impact performance.

            Thing is, repeatedly people have significant speedups from compiling Python with -fno-semantic-interposition; this is why Fedora and then RHEL enabled. Docker’s Python doesn’t use this. So for small differences, yeah, hard to tell, but 20% difference is pretty significant.

            1. 1

              ah yes, that is the paper I was thinking of. I misremembered it being java, it is perl in the paper.

              There are three important points to take away from this graph. First, depending on link order, the magnitude of our conclusion (height of y-axis) can significantly fluctuate (0.92vs 1.10). Second, depending on link order, O3 either gives a speedup over O2(i.e.,y value is greater than 1.0) or a slow down over O2 (i.e., y value is less than 1.0). Third, some randomly picked link orders outperform both the de-fault and alphabetical link orders. Because we repeated the experiment for each data point multiple times, these points are not anomalies but true, reproducible behavior.

            2. 3

              There’s so much weird performance stuff in CPUs these days. One (relatively) big factor is what the value of your instruction pointer happens to be at the different branches; if you’re unlucky, your branches could be laid out in such a way that multiple branches have instruction addresses which end up hashing to the same value in the branch predictor, which could hypothetically totally tank performance by making your otherwise predictable branches unpredictable. Another factor is just how the stack happens to be laid out.

              I heard a story once where a program was just measurably slower on wednesdays. The cause turned out to be that environment variables are before the stack in memory, and the current day of the week was stored in an environment variable, and the length of the word “Wednesday” just happened to push the stack down in a way which impacted performance.

              There’s a bunch of good talks on the topic, which cover how this affects benchmarking. You could be making changes in your code and think you’re making great performance gains, when you just happened to change the locations of a few branches to avoid performance traps.

            3. 1

              I wish the author did longer running benchmarks than just a few miliseconds. It’d be interesting to see if those numbers are significant with n=1000 and over a million. Some of these differences may be insignificant over a large data set; others more pronounced. If you graphed them by n size, you should also see if the O() changes at all (it shouldn’t).

              1. 1

                I don’t think that’s a valid complaint. There’s a clear trend across all the benchmarks he showed in the article.

                1. 1

                  These are some of the benchmarks the CPython developers are using to track CPython performance, so I assume they’re decently meaningful: https://github.com/python/pyperformance

                  Also the “shared library makes things slow” and “-fno-semantic-interposition makes shared library less slow” is not my conclusions, it’s result of other people benchmarking.

                  I also did some initial benchmarking with pystone; same results.