1. 48
  1.  

    1. 12

      Interesting tidbits from following links:

      • There is a regression in clang 19 (in fact in LLVM) that made the Python interpreter slower than with clang 18 by about 10% on some systems, and the 15% performance improvement reported for the new tail-call interpreter comes mostly from working around this regression. Once the LLVM regression is fixed, the non-tail-call Python interpreter speed gets back to normal, and the tail-call version only improves performance by 5%. (These numbers are ballpark, it depends on the benchmark and architecture and…)

      • The clang19 bug was found and reported in August 2024 by Mikulas Patocka, who develops a hobby programming language with an interpreter. In the issue thread, several other hobbyist language implementors chime in, well before the impact on Python (and basically all other interpreted languages) is identified. This is another example of how hobbyist language implementors help us move forward.

      • The LLVM regression was caused by another bugfix, that solves a quadratic blowup in compile time for some programs using computed gotos. This is an example that compiler development is tricky.

      • The LLVM regression was fixed by DianQK, who contributes to both LLVM and rustc, by reading the GCC source code, who already identified this potential regression and explicitly points out that one has to be careful with the computed gotos as used in threaded bytecode interpreters. This is an example that GCC/LLVM cross-fertilization remains relevant to compiler development and the broader open source ecosystem, even to this day.

      1. 5

        This blog post has far more plot twists than I expected!

        1. 2

          the alternative (without this work) probably really was 10-15% slower, for builds using clang-19 or newer. For instance, Simon Willison reproduced the 10% speedup “in the wild,” as compared to Python 3.13, using builds from python-build-standalone.

          Hm though as far as I know, almost 100% of Linux distros use GCC … So say on Debian and Red Hat, Python 3.14 will only give a 2% improvement?

          It seems like that will be noticed widely when those distros update … e.g. if people are expecting 10-15% improvement with a new Python version, but only get 2% :-(

          (and btw this is a great investigation, and writeup!)

          1. 2

            these impressive performance gains turned out to be primarily due to inadvertently working around a regression in LLVM 19. When benchmarked against a better baseline (such GCC, clang-18, or LLVM 19 with certain tuning flags), the performance gain drops to 1-5% or so depending on the exact setup.

            5% for free isn’t bad.

            1.  

              free only for users, not for devs. The code is now more complicated.

              1.  

                It’s free for Python devs

            2. 2

              Semi-relatedly: despite not having any explicit support for tail calls, I recently discovered that Rust (because of LLVM) will generate tail calls if the call is in the right position. Note the “jmp” here: https://godbolt.org/z/f6P41qqv5 . I have been experimenting with making use of it in my emulator for the same performance reasons Python wants to use it.

              Unfortunately, these tail calls do not carry over to Rust generating wasm. C++ to wasm appears to generate tail calls if you pass -mtail-call, but I couldn’t figure out if there was a Rust was to trigger it. Here’s at least C++ wasm tail calls: https://godbolt.org/z/Ebc81vnGa (the “return_call_indirect” op there).

              (PS: Rust does have some ideas/plans around a “become” keyword eventually but there’s nothing usable yet, bummer.)

              1. 5

                C/C++ don’t explicitly support it either, but it’s a ubiquitous optimization. Clang did add a nonstandard “must tail-call” attribute to force a tail-call regardless of optimization level, which is useful in development of CPS code so your debug builds don’t blow the stack!

              2. 2

                When the tail-call interpreter was announced, I was surprised and impressed by the performance improvements, but also confused: I’m not an expert, but I’m passingly-familiar with modern CPU hardware, compilers, and interpreter design, and I couldn’t explain why this change would be so effective.

                Humble guy.