1. 112

  2. 25

    Great post! The Python/Ruby/JS interpreter startup problem has been bugging me for more than a decade… To add some detail, there are a few separate problems there:

    1. CPU for interpreter initialization. This can be dozen of milliseconds as you say, but I think it’s really the smallest cost.
    2. I/O for imports. There is an algorithmic problem here if you have M entries on PYTHONPATH, and N imports, then the Python startup algorithm does O(M*N)*constant stat() calls.
    • stat() itself can be surprisingly slow because it can be random access I/O on spinning media.
    • there has historically been a mess of Python packaging tools, and many of them made PYTHONPATH very long.
    • most Python programs have a lot of imports
    • The constant is even too large: Python does many stats() for stuff like foo.py, foo.pyc, foo.so and even foo.pyo. IMO .pyo should just be removed because the cost just to find them exceeds any benefit they ever had (or maybe they already have been, I haven’t kept up)
    • Startup time is one of the only performance dimensions that is STILL worse in Python 3 than Python 2 (last I checked).
    1. Ignoring the time to just find file, “import somepythonlib” often does a surprising amount of computation, and this happens before you get to your own main(), even if you don’t use the module at all in that program run. For example if you import the stdlib it will compile regexes and namedtuples you don’t use.

    If you import a lot of libraries in Python you can easily get to 300ms or 500 ms startup time. It feels like 50-100 ms is almost the minimum, which is kind of sad.

    You can fork a process in a millisecond (or maybe 10x less if it’s statically linked), so a Python process is 100x-1000x too slow to start up, while it’s more like 10x-100x slower than C at runtime (usually closer to 10x).

    I am thinking of reviving Oil’s “coprocess protocol” (post from 2 years ago) to solve this problem. I think I found a good way to do it on Unix only – with descriptor passing. Basically this is a way to keep processes warm WITH the constraint that you should not have to change your command line tool very much to use it. Random tools in any language should be patched easily to be (single threaded) coprocesses.

    I had some discussions with people around multiplexing stdout/stderr over a single channel, and now I think that’s the wrong way to do it.

    Great info about Zstandard too! I’d be interested in the follow-ups – there is a lot of great information here, and it easily could have been 5 posts.

    1. 9

      I’ve thought a lot about this problem space from the Python side. I (the author of the post) am also the maintainer of PyOxidizer, which has dabbled in improving efficiency of Python imports and startup using alternate module loading techniques. Read https://gregoryszorc.com/blog/2020/05/10/using-rust-to-power-python-importing-with-oxidized_importer/ for example. And see some of the ideas captured at https://pyoxidizer.readthedocs.io/en/stable/status.html for improving various aspects of Python performance.

      1. 5

        Yes I remember chatting about that last year or few years ago! I was in the middle of patching CPython for my own purposes, and really saw how bad the issue is. Oil (https://oilshell.org) is currently delivered as a Python app bundle, but that’s going away in favor of native code (for performance reasons that are NOT startup time.)

        I think you would have to break Python semantics to get more than a mild speedup in Python. And the best time to do that was Python 3, which has long past :-/

        But I’m still interested in the startup time issue from the shell side now. Python is far from the worst offender… All the JITs, and Ruby, are considerably worse. Perl/Awk are better. It would be nice to solve them “all at once” with coprocesses.

      2. 3

        The worst import time I’ve come across:

        pip3 install --user colour_demosaicing
        env -i time python3 -c "from colour_demosaicing import demosaicing_CFA_Bayer_bilinear"

        Takes ~2s (warm) on a modern laptop.

      3. 19

        What this means is that by default, binaries provided by many Linux distributions won’t contain instructions from modern Instruction Set Architectures (ISAs). No SSE4. No AVX. No AVX2. And more. (Well, technically binaries can contain newer instructions. But they likely won’t be in default code paths and there will likely be run-time dispatching code to opt into using them.)

        Yup, this was why I pushed so hard to get CPU feature detection and the ability to compile functions with specific target features in Rust. Before this, I was distributing ripgrep binaries compiled with SSSE3 enabled. In case you don’t know, SSSE3 appeared on x86_64 CPUs as early as 2006. But I still got at least several bug reports from folks who tried to run these binaries and got SIGILL. (At least some of them appeared to be in a VM of some sort with limited CPU feature support.) Once Rust got runtime feature detection, it was great to switch over to that. Not only can I publish portable binaries, but it also means that the binaries published by Linux distros (and Homebrew) all have this stuff enabled too. By default. Without the person compiling the program needing to know to set -mtune or what have you.

        C programs do this too. For example, if you call memchr when using glibc, it will do the same kind of CPUID detection and dispatch to SSE2 or AVX2 or whatever implementation. So the use of more recent ISAs may be a bit more prevalent than you might think because of this. And this sort of dynamic CPU feature detection is typically used specifically in code that tends to be hot. So a lot of us are probably already benefiting quite a bit.

        But, I do agree with the OP’s bigger point. The extent to which we are missing out on additional optimizations is indeed unclear. Although I will say that in my own unscientific experiments, it’s pretty rare for me to compile with -march=native and see a noticeable improvement. (My hypothesis being that the performance sensitive parts have already been written to take advantage of the new ISA extensions via dynamic CPU feature detection.)

        1. 2

          Not only can I publish portable binaries

          What does portable mean here? Conditional compilation applies during, well, compilation; my understanding is that users must recompile their programs but they don’t have to hunt compiler feature flags to enable stuff working on their machines, right?

          1. 5

            It means the binaries I publish will work on x86 cpus without AVX2. But when AVX2 is enabled, it will use different routines that make use of 256 bit vectors. No recompilation or conditional compilation needed. It just works. :)

            1. 3

              I see. Isn’t this conditional jump expensive? How much would perfomance differ between binary with dynamic detection and binary with hardcoded AVX2 (or others) instructions?

              1. 7
                1. The programmer controls where the jump is, so you can put it at a point where it isn’t so expensive relative to the work.
                2. The check can be cached so that you aren’t re-running the cpuid instruction.
                3. You can run the check once the first time you call the function and then stuff the actual AVX routine into a function pointer. This prevents inlining, but you couldn’t inline an AVX routine into a program compiled with the base x86_64 ISA anyway. Example: https://github.com/BurntSushi/rust-memchr/blob/427fdc384007d0a5b00b190e8313b3c8d3694a67/src/x86/mod.rs#L31

                glibc does this. Pretty much everyone who who writes low level open source libraries that can benefit from vectorization does this too.

                Bottom line: yes, the branch can be expensive, but there is almost always a way to make it practically free. Last time I checked, glibc’s memchr took on the order of 1 nanosecond to run on empty input.

          2. 1

            Yeah but to be honest, I’d also like to use things like loop unrolling with AVX2 optimizations without having to manually implement the SSE #[cfg()]‘s (and is_XXX_feature_detected) myself. So your solution is nice for things where you know how and where to apply AVX2. But it sadly doesn’t help people like me who would probably still benefit from SIMD but don’t implement it manually. So hopefully at some point the compiler will add feature detection to generated SIMD instructions by itself, so we don’t have to decide on the LCD any more.

            1. 2

              Yes, it doesn’t cover all use cases. But it hits done very big and important ones.

          3. 12

            Even if you perform all file open and write I/O on a single thread and use a background thread pool only for calling CloseHandle(), you can see a >3x speedup in time to write files.


            1. 6

              The CloseHandle() one was the biggest surprise for me, hunting down and finding Windows Defender / AV as the culprit must of been quite the pursuit. Interesting that the work around can end up being faster than Linux!

              1. 7

                There’s also another issue there, which is an exclusive lock is used as part of CloseHandle() that’s acquired shared when writing data to disk. This means that even without a virus scanner, it’s possible to open a file, write to the cache (at memory speed), then close the handle and the close will wait for that data to get written to disk. It’s a race condition based on when the system decides to write the cache, but it’s not that unlikely when running a build that writes a lot of data into the cache.

                Edit: That said, the solution of processing CloseHandle() in the background has big caveats. A lot of work in CloseHandle is required to be synchronous, to do things like deleting the file if needed, releasing sharing modes to allow future opens, releasing byte range locks, indicate to other services that the file has changed, etc. Doing this asynchronously may or may not be acceptable - if the app is never going to try to open another handle to the file and is not deleting it, it doesn’t matter, but it’s not safe to generically defer this work without risking introducing sharing violations.

                Edit #2: As far as hunting it down, xperf is … well, I can’t say xperf is great, but it can capture a lot of data. A flame graph in wpa takes a while to render, but it highlights things like APIs that take more time than you expect very easily. The harder part of performance work is wait chain analysis where something is waiting for something that’s waiting for something (etc) with all kinds of different synchronization primitives and you want to know why it’s not moving.

                1. 3

                  While I didn’t realize it at the time, the cause for this was/is Windows Defender.

                  These are my favorite, where you know that something is a problem, but never get to figure out why. And then you find the why some months later, and go “oh holy shit!”

                  1. -1

                    hunting down and finding Windows Defender / AV as the culprit must of been quite the pursuit

                    Do you mean must have been?

                2. 7

                  It’s certainly true that using a simple memcmp that doesn’t take advantage of the architecture can make a big difference.

                  This test is good torture test of memcmp. Built with gcc -O2 -DTEST_ALL on my machine, it takes about 68 seconds to run. Forcing it to use this implementation of memcmp (and allowing for inlining) results in the test taking about 89 seconds. If you disallow inlining, it takes about 96 seconds.

                  memcmp and friends, when using architecture-specific instructions, can be very fast. It is very much worth your time to profile them for your use case because they may make a simpler algorithm with a worse time complexity faster than a more complicated algorithm with a better time complexity, mainly because “linear” may not be as linear as it looks.

                  1. 6

                    When people say databases and other stores should be isolated to their own volumes/filesystems, fsync()’s wonky behavior is a big reason why.

                    Does anyone know if ZFS is better about this?

                    1. 10

                      Yes and no. ZFS uses an intents log to speed up synchronous writes: data is written to the intents log and then rewritten out to its final destination. This means that it’s safe in case of a crash as soon as you receive an acknowledgement from the log write. In a burst-write-heavy high-performance environment, you can use a separate log device, so that you’re writing transactions to a high-performance SSD and more slowly flushing them to other storage. You can often get away with less reliability here because you only lose data if the log device dies and the computer crashes at the same time.

                      The other thing that you can do with ZFS is set ‘sync=disabled’ on a given dataset. This means that operations such as fsync will return immediately, but the data isn’t actually on the disk. I do all of my builds from a dataset mounted in ~/Build with this property set. If the machine crashes, I can lose data in this filesystem, but I don’t care because it’s just build state and I can always recreate it from the source trees.

                      1. 8

                        Afaik no; the problem with fsync is that it’s mis-specified. Dan Luu has a good article and talk about it somewhere. fsync should have been ffence.

                        1. 2

                          Hey, I just noticed you left MS (congrats?)

                          I hadn’t heard of a file system that flushes data for unrelated files in response to a file flush before. But the problem we have in file systems is to capture all of the metadata that’s required for a specific data flush to make sense. To take a simple example, imagine that a write extended the file, followed by a flush. In order to ensure the data is reachable, that implies writing the data as well as whatever describes the file extension.

                          In practice, this gets coarser, because the system may not know what metadata requires capturing. Perhaps you opened the file and wrote to it, then flushed. Does that mean the file timestamp needs to be updated? If you created the file and wrote to it, does the file metadata or directory entry need to be written? The nastiest case of this in NTFS is valid data, where even if you allocate and extend the file and flush it (so the metadata is durable), any write is going to update the range of data which is valid on disk, so a subsequent flush still needs to write metadata.

                          And for it to be durable, it needs to flush the disk hardware cache, and the hardware has no idea about files. It’s just any data that’s been written recently that hasn’t been hardened in hardware which needs to be written immediately.

                          There are optimizations that can (and have) been built. fdatasync() is the most obvious. On Windows, there’s a similar thing although it’s very difficult to use correctly because it only flushes data (never metadata) so the caller really needs to understand how the metadata in the file system works to ensure that it is durable when needed.

                          1. 5

                            Great post overall. I have a few comments on the section about compression.

                            The comparison of the ratio between network bandwidth and CPU speed is too simplistic since it doesn’t account for IPC and multi-threading (as stated by the author but their effect are dramatic), RAM available and costs (which is probably were the difference is the most important). This doesn’t invalidate the idea behind this however.

                            I have a couple remarks about speeds and especially LZMA’s decompression speed which the author calls extremely slow. While it is slower than zstd, it’s still not that slow. On a given benchmark [1] the ratio is around 5 which is less than an order of magnitude and it’s only twice as slow as deflate. The difference is not enough to be able to always rule out one in favor of the other. This wouldn’t be the case for bzip2 which decompression is roughly as slow as its compression (only twice faster) while LZMA’s is much faster (roughly ten times faster).

                            By the way, there’s a “zlib-ng” project which advertises much faster compression and decompression: https://github.com/zlib-ng/zlib-ng (see https://github.com/zlib-ng/zlib-ng/discussions/871 for performance benchmarks).

                            [1] https://mcmilk.de/projects/7-Zip-zstd/dl/decomp-v120.png

                            edit: and a couple words about ISAs: it’s possible to have run-time detection of CPU features; this has even been rolled into glibc.

                            1. 3

                              Your call out on the cycles per byte is fair: I cherry picked the numbers and methodology to reinforce my point, taking care to qualify it because I knew it wasn’t that fair. If I ever write a dedicated follow-up post on zstd performance, my methodology will be much more rigorous :)

                              You are correct that lzma is often less than an order of magnitude slower. However, the relative slowness is often enough to make it the bottleneck. My general belief is that if you are CPU constrained by decompression, there are better options. And with LZMA’s decompression output speed peaking at ~100 MB/s, you often are CPU constrained by LZMA. Not always, but more often than zstd with its typical 1,000 MB/s decompression output. Depending on the line speed of the software you are feeding decompressed data to, this can make a big difference! Keep in mind that on single threaded pipelines, every cycle spent in the compression library is a cycle not spent in your application logic. My general belief is unless you are optimizing for low memory use or minimizing the number of compressed bytes, why take a chance. That’s why I think zstd is a better default for most cases.

                            2. 3

                              At my last job, we wrote a Python script that was intended to be called frequently. All it did was call out to an external API and print a result on stdout. Dumb dead simple code. It took half a second to run. Given the situations in which it was intended to be used, half a second of running time was totally unacceptable.

                              I did some profiling, and most of the wall-clock time was spent at startup, in one statement:

                              import pkg_resources

                              That wasn’t even in our code. It came from one of our dependencies. After doing some research, I discovered that the use of pkg_resources was completely optional. The importing library even had a conditional block so that it would work if pkg_resources was not found.

                              I wrote a custom importer to use with this script. It raised ImportError when something attempted an import of pkg_resources. It shaved well over a third of a second off of the average wall clock time for the program, bringing it back within the bounds of acceptability.

                              1. 3
                                1. 3

                                  One simple (simplistic?) measurement of pure startup time for interpreters: https://gist.github.com//catwell/8189169.