1. 33
    1. 35

      TL;DR I used my Scalene profiler and some math to make your example program 200x faster.

      Hi Martin - I am quite interested in Python performance so naturally I read your article, which I found interesting. I wondered if I would be able to get more useful information out of Scalene (https://github.com/emeryberger/scalene/, pip install scalene), a profiler I wrote.

      I developed Scalene to be a lot more useful than cProfile (or, to be honest, pretty much any other profiler): it provides line-level information, splits out Python and native time (since we mostly care about time spent in Python), profiling memory allocation and even copying costs, all at a line granularity.

      Anyway, here’s the result of running Scalene (just for CPU profiling here) on your example code. It really cuts to the chase.

      % scalene test-martinheinz.py
      test-martinheinz.py: % of CPU time = 100.00% out of  12.62s.
           |    CPU % |    CPU % |   
      Line | (Python) | (native) |    [test-martinheinz.py]

         1 |          |          | # slow_program.py
         2 |          |          | from decimal import *
         3 |          |          | 
         4 |          |          | def exp(x):
         5 |          |          |     getcontext().prec += 2
         6 |          |          |     i, lasts, s, fact, num = 0, 0, 1, 1, 1
         7 |    0.08% |    0.04% |     while s != lasts:
         8 |          |          |         lasts = s
         9 |          |          |         i += 1
        10 |          |          |         fact *= i
        11 |          |          |         num *= x
        12 |   85.84% |   14.05% |         s += num / fact
        13 |          |          |     getcontext().prec -= 2
        14 |          |          |     return +s
        15 |          |          | 
        16 |          |          | exp(Decimal(150))
        17 |          |          | exp(Decimal(400))
        18 |          |          | exp(Decimal(3000))

      You can see that practically all the execution time is spent computing the ratio between num and fact, so really this is the only place to focus any optimization efforts.

      That line is dividing two Decimals, and those numbers are getting really large really fast, so it’s no surprise it’s expensive. Let’s see what we can do to make those numbers smaller.

      My observation is that one can compute num / fact by updating a variable on each loop iteration, involving a computation on drastically smaller numbers. To do this, I add a new variable nf which will hold num / fact. Then, on each loop iteration, the program updates nf by multiplying it by x / i. You can verify this maintains the invariant nf == num/fact by observing the following (where _new means the computation of the updated variable in each iteration).

      nf == num / fact          # true by induction
      nf_new == nf * (x / i)    # we multiply by x/i each time
      nf_new == (num / fact) * (x / i)  # definition of nf
      nf_new == (num * x) / (fact * i)  # re-arranging
      nf_new == num_new / fact_new      # reflecting the computations on line 10,11

      Incorporating this into your original program required changing three lines of code, all of which are followed by ###:

      def exp(x):
        getcontext().prec += 2
        i, lasts, s, fact, num = 0, 0, 1, 1, 1
        nf = Decimal(1) ### = num / fact
        while s != lasts:
            lasts = s
            i += 1
            fact *= i
            num *= x
            nf *= (x / i) ### update nf to be num / fact
            s += nf ### was: s += num / fact
        getcontext().prec -= 2
        return +s

      The result of this change is dramatic. I changed both programs to print out their results:

      On my laptop, original version:

      % time python3 test-martinheinz.py
      python3 test-martinheinz.py  12.60s user 0.10s system 94% cpu 13.461 total

      The optimized version:

      % time python3 test-martinheinz-opt.py
      python3 test-martinheinz-opt.py  0.04s user 0.01s system 87% cpu 0.057 total

      More than a 200X speedup (236, to be precise).

      The moral of the story is that using a more detailed profiler like Scalene can really help optimization efforts by locating inefficiencies in an actionable way. I’d love to see a follow-up article reflecting this :).

      1. 4

        That was a tour-de-force.

      2. 2

        Thanks for the writeup. I was wondering why OP led with an example which never got optimized.

        1. 1

          Thanks! I also read the article waiting for the reveal, but since it never came…

    2. 11

      On one hand I liked the first part of your article, “Timing and Profiling” which is informative.

      But on the other hand, I think that the advices you give just makes python code barely faster, except the two first but:

      • Using built-in datatypes shows that if you want to have something faster in python it should be coded in C under the hood.
      • Memoization is more about understanding your problem and solving it by an appropriate technique rather than using a python specific optimization.

      In the end if you actually want noticeable performance gain, you have to go either for Python C-extensions (or Nim, Rust, you name it…) or port your code base (or part of it) in a programming language that can deliver faster code.

      I am not a “Python haters”, I pay my bill with it, but I can’t say Python is fast because it is one one the slowest scripting language out there. The only thing that can be said is Python can be fast enough for your use case.

    3. 16

      Consider publishing your content outside Medium to make it easier for people to access it.

      1. 18

        If you don’t like Medium you can find all my articles on website, for example this one here: https://martinheinz.dev/blog/13.

        1. 28

          Yeah, generally we do prefer non-medium links. I’ll swap it in.

        2. 6

          I don’t have anything specific against Medium, but I physically cannot access more than 2 paragraphs due to their policies.

        3. 5

          If you don’t like Medium you can find all my articles on website, for example this one here: https://martinheinz.dev/blog/13

          It says:

          We’re sorry but frontend doesn’t work properly without JavaScript enabled. Please enable it to continue.

          And, after enabling first-party javascript:

          Invalid date undefined

        4. 2

          This link comes up completely blank in Safari on Mac. On Chrome, the browser attempts to connect to a web service on port 1234 and when I reject this, it also ends up blank. Would you consider running your website on one of the more common HTTP or HTTPS ports?

          1. 3

            Similarly with Firefox, which would not connect to port 1234 as “the resource is not secure.”

    4. 5

      Nice article. Might be worth also mentioning flame graphs in python (py-spy is a decent lib as an example, from what I recall… but there are others too), as another way to drill down into performance issues.

    5. 4

      Microoptimizations like “don’t do attribute lookups” are all fine and good, but it’s workarounds for super-slow runtime. PyPy would run this code much faster, numba JIT too. If you care about performance, plain CPython is not how you should be running your numeric code.

    6. 3

      One thing that surprised me was the “Don’t Access Attributes” example, where findall() is faster than re.findall(). The runtime could (easily?) cache this lookup if it’s in a tight loop (or tiny scopes in general).

      Some bookkeeping required; i.e: if the object attribute changes; invalidate the cache. I don’t know how Python works in the background though so this optimization might not be worth the extra code.

      1. 2

        At its core, CPython is a very simple stack-based virtual machine for Python bytecode. The main loop, for example, is literally just a switch statement with one case per bytecode instruction.

        There are some optimizations in there – the compiler can do some constant folding, the bytecode interpreter knows some pairs of instructions are likely to occur together and can handle them in ways that make your processor’s branch predictor happy, and so on – but not that many. As far as I’m aware, its simplicity is a deliberate choice.

        There are alternative implementations of Python (like PyPy) which do much more advanced things and can offer corresponding performance gains, but generally people who are choosing Python are making a tradeoff of CPU time for developer time, and if they hit situations where they absolutely must get code to run faster, they’ll investigate either small, targeted optimization tricks (like aliasing a nonlocal name accessed repeatedly in a loop, or otherwise rewriting the code not to access a nonlocal from inside a loop) where available, or just rewrite performance-critical sections in C.

        If you want more detail, I gave an intro talk at PyCon a couple years ago (video, slides since the A/V setup had to use a degraded version). The slide deck has pointers to some more in-depth materials.

        1. 1

          Thank you for providing extra info. I’ve added your video to my queue/playlist.

    7. 3

      No numbers though? :(

    8. 2

      Nice post, concise and useful. Been using Python for a while, found both good reminders and a few points I had not considered before. Thanks.

    9. 2

      I wonder how fast this can be made in Julia.