1. 15
  1. 12

    And either the Rust standard library or possibly the Rust compiler–I’m not sure which–are smart enough to use a (slightly different) fixed-time calculation.

    That’s LLVM. It actually is surprisingly smart. If you have a sum for i from a to b, where the summation term is a polynomial of degree n, there exists a closed form expression for the summation, which would be a polynomial of degree n+1. LLVM can figure this out, so sums of squares, cubes, etc get a nice O(1) formula.

    Another good story with the same theme is https://code.visualstudio.com/blogs/2018/03/23/text-buffer-reimplementation. “You can always rewrite hot spots in a faster, lower level language” isn’t generally true if you combine two arbitrary languages.

    1. 1

      Thanks for the explanation, I’ll update the article.

      1. 1

        A fairer comparison would thus mean using LLVM on the Python code too (see: Numba). Given the example domain, I’d further be interested to know the speed difference between Rust on the CPU to Python on the GPU.

        1. 1

          These are all toy examples, the point was never that Rust is faster, as I mention Rust can actually be slower than Cython. The updated article points out on the default compiler on macOS is clang, so you might get that optimization with Cython too.

    2. 4

      This is a good subject for an article, and looks like good info, but the opening sentence makes me think it’s going to be bad. There’s some truth there but it lacks a lot of subtlety – slow and fast are relative to the application, etc. IMO a better opening sentence would be something like: “I hit this performance wall in my Python code, tried to speed it up with C, and was surprised”. What happened here? Why is this slow? etc.

      FWIW the way I think of it is that Python is 10-50x slower than native code, usually closer to 10x. So it’s about 1 order of magnitude. You have around 9 orders of magnitude to play on a machine; a distributed system might make that 11 to 14. Lots of software is 1 to 3 orders of magnitude too slow, regardless of language.

      Also, the most performance intensive Python app I ended up optimizing in C++ was mainly for reasons of memory, not speed. You get about an order of magnitude increase there too.

      1. 10

        Your estimate of of 10x seems WAY too low in my experience. It obviously depends on the use case, I/O bound programs will obviously be much closer because you’re not actually waiting for the language, but number crunching code tends to be closer to 100x than 10x.

        I just did a quick test, a loop with 100000000 function calls in C and Python. The C loop ran in 0.2 seconds; the Python program in 17.2 seconds. That’s an 86x difference. (Yes, the C code was calling a function from another TU, using a call instruction. The compiler didn’t optimize away the loop.)

        I also implemented the naive recursive factorial function in both C and Python. The C version calculated fib(40) in 0.3 seconds. The Python version calculated fib(40) in 42.5 seconds. That’s an 142x difference.

        I implemented the function to do 1 + 2 + 3 + ... + n (basically factorial but for addition) in C and Python, using the obvious iterative method. C did the numbers up to 1000000000 in 0.41 seconds. Python did it in one minute 38 seconds. That’s a 245x difference.

        Don’t get me wrong, Python is a fine language. I use it a lot. It’s fast enough for most things, and tools like numpy lets you do number crunching fairly quick in Python (by doing the number crunching in C instead of in Python). But Python is ABSOLUTELY a slow language, and depending on what you’re doing, rewriting your code from Python to C, C++ or Rust is likely to make your code hundreds of times faster. I have personally experienced, many times, that my Python code is anlayzing some large dataset in hours while C++ would’ve done it in seconds or minutes.

        You have around 9 orders of magnitude to play on a machine

        This is very often false. Games often spend many milliseconds on physics simulation; one order of magnitude is the difference between 60 FPS and 6 FPS. Non-game physics simulations can often take minutes; you don’t want to slow that down by a couple of orders of magnitude. Analyzing giant datasets can take hours; you really don’t want to slow that down by a few order of magnitude.

        1. 5

          Sure, I said 10 - 50x, but you can say 10 - 100x or 10 - 200x if you want. I measured exactly the fib use case at over 100x here:


          Those are microbenchmarks though. IME 10-50x puts you more in the “centroid” of the curve. You can come up with examples on the other side too.

          I’d say it’s closer to 10x for the use cases that people actually use Python for. People don’t use it to write 60 fps games, because it is too slow for that in general.

          But this is all besides the point… If the post had included the subtleties that you replied with, then I wouldn’t quibble. My point is that making blanket statements without subtlety distracts from the main point of the article, which is good.

          1. 5

            But my point is that the subtleties aren’t required, because (C)Python just is a slow language. It doesn’t have to be qualified. Its math operations are slow, its function calls are slow, its control flow constructs are slow, its variable lookups are slow, it’s just slow by almost any metric compared to JITs and native code. If the article had started with a value judgement, like “Python is too slow”, I would agree with you, but “Python is slow” seems extremely defensible as a blanket statement.

            1. 5

              Well I’d say it’s not a useful statement. OK let’s concede it’s slow for a minute – now what? Do I stop using it?

              A more helpful statement is to say what it’s slow relative to, and what it’s slow for.

              To flip it around, R is generally 10x slower than Python (it’s pretty easy to find some shockingly slow code in the wild; it can be optimized if you have the right knowledge). It’s still the best tool for many jobs, and I use it over Python, even though I know Python better. The language just has a better way of expressing certain problems.

          2. 4

            There’s actually another dimension of slowness that I see people often forget about when making comparisons like this. Due to the GIL, you’re essentially limited to a single core when your Python code is running, but a C/Rust/Go/Haskell program can use all the available cores within a single process. This means that in addition to the x10-x100 speed up you get from using those languages, you have another x10-x100 room for vertical scaling within a single process, for a combined x100-x10000. Of course, you can run multiple Python processes on the same hardware, or run them across a cluster of single core instances, but you’re not in a single process anymore. Which means it’s much harder to share memory and you have new inter-process architectural challenges which might limit what you can do.

            For example, if I write my web application backend in Haskell, I can expect to vertically scale it quite a lot and depending on the use case, I might even decide to stick with a single process model permanently, where I know that all the requests will arrive at the same process, so I can take advantage of the available memory for caching and I can keep ephemeral state local to the process, greatly simplifying my programming model. Single-process concurrency is much simpler than distributed concurrency after all. If I wrote that backend in Python, I would have to design for multi-process from the start, because SQLAlchemy would very soon bottleneck the GIL while generating my SQL queries…

        2. 2

          It feels a bit wrong to test a bunch of Cython code and say this is all endemic to C extensions. I’m not 100% fresh on all this but the serialization/deserialization problem exists if you do that in the first place! You could choose not to if you were interfacing with the raw CPythom interface.

          I wonder what mypyc would give as a result here as well… probably worse but its codegen tries to rely heavily on branch prediction to make unwrapping cheap.

          1. 3

            Cython is a highly optimized use of the raw CPython interface. PyO3 uses the CPython interface too and it has much higher overhead since it’s not been optimized as much yet.

            And yes, you can just do math using Python integers without serialization/deserialization… and then it won’t be any faster than Python’s slow math (internally Python has to deserialize in order to do the underlying CPU addition instructions, and then reserialize into a Python integer, for example).

            You can’t get away from the cost given the way CPython is implemented. PyPy is a whole different thing.