1. 32
    1. 6

      A couple of additional approaches you can take:

      1. Do less unnecessary work. For example, len(l) is called len(l) times in the inner loop of the f2() version of the code, even though the length doesn’t change. In a compiled language, this could potentially be hoisted by the compiler, but in Python you have to do it manually (this year, at least).

      2. Tune your code to work better with the underlying systems. For example, you can tune your low-level compiled code to use instruction-level parallelism or SIMD or match the CPU memory hierarchy and get significant speedups. (I’m working on a book about this aimed at Python developer doing data processing.) Optimizing for modern CPUs generically will eventually become obsolete, but that will take years, and it will help across many machines; you can also tune for very specific hardware at the cost of much more brittleness.

      1. 2

        Your 1. made me think about caching and precomputation (like lookup tables or moving more things to compile time/startup time instead of at runtime). Those are also typical approaches to optimize stuff. Often you can trade time for memory usage and these are all examples of such.

      2. 5

        For me, “rewriting in a lower-level language” is also the least appealing kind of optimization, at least for more than say 1000 lines. The limiting factor seems to be the architecture – whether there is actually 1000 lines out of 100K that would yield a big speedup.

        As far as rewriting the whole codebase:

        1. It’s kind of boring and mechanical
        2. The success rate is pretty low (how many 100K line Python programs have been completely re-written in Rust? 500K line?)
        3. What you end up is with is more code all around. Lower level languages require more code, and more code is harder to understand. It’s also harder to optimize !!!

        In my experience, understanding is the biggest barrier to optimization. What I would really like is perf-sensitive code in a low level language, and the rest in higher-level languages, although this is hard in general. Crossing boundaries is a problem.


        One thing the article doesn’t emphasize is that in practice your hands are often tied by the size of the codebase. You often can’t “just” use a better data structure.

        For example everyone is complaining about LLVM compile times now. Well what would it take to speed it up by 2x?

        When you have 1M lines of code, I think the answer is often “there is nothing you can do”. The choice of many data structures (and thus algorithms) is fundamentally weaved throughout the whole program. At its core, LLVM is roughly one big data structure (an SSA IR).

        What would it take to speed up the Rust compiler by 2x? (compared to today – I think it has already been sped up 2x through pretty heroic efforts)

        So at a certain point with every codebase, you’re left with 1% to 5% improvements, and fundamentally speeding it up may mean starting over again, which is extremely costly.


        So personally I want to express the system with the least amount of code, so it’s easier to understand, re-compile to lower level code, and possibly rewrite. I think modern software takes on too many dependencies, which basically is a way of “coding yourself into a trap”.

        So then another kind of optimization besides changing data structures, algorithms, lower-level languages, and approximation is:

        (5) Use a higher level abstraction / idea, and write a compiler / translator.

        I often say that in Oils we use regular languages for the lexer, expressed in with the standard Python re module:

        https://www.oilshell.org/release/0.17.0/source-code.wwz/frontend/lexer_def.py

        But then I hacked up the sre_parse module to lower to a re2c lexer definition:

        https://www.oilshell.org/release/0.17.0/source-code.wwz/_build/tmp/frontend/match.re2c.txt

        And then re2c gives us pure C state machines at the end. Then the C++ compiler can inline these functions with the rest of the lexer:

        https://www.oilshell.org/release/0.17.0/source-code.wwz/_gen/frontend/match.re2c.h

        So we end up with 2 working instances of the same code – one is probably 10-100x faster than the other. (It’s hard to separate the lexer speed from parser speed. Combined with mycpp, the parser was 160x-200x faster before we added garbage collection, so now it’s probably 80x faster or something.)

        I mean I guess you could argue we had a “de-optimized” version first :) But again I’d argue that it’s difficult to optimize a program that has no notion of correctness. The first step is understanding. Speeding up bash by 2x is similarly “impossible”.


        Also I think the first kind of optimization is:

        (6) Remove “dumb mistakes” in the code.

        I’d say almost every codebase that’s larger than 100K lines has some dumb mistakes. The most common bugs IME are:

        • N+1 anti-patterns – e.g. making N round trips instead 1, for SQL queries, opening network connections, network round trips, disk seeks, starting processes, and starting threads
        • “Accidentally quadratic” – somebody called a function that does a linear search in a loop, etc.

        I’ve fixed both of those bugs dozens of times. I’d go as far as to say that if you haven’t fixed those bugs, they’re probably in your code and you don’t know about them. I certainly see the excessive round trips problem across every web app I use, pretty much every day.


        The corrollary is actually measuring performance. That’s the way to catch the dumb mistakes.

        I’d claim that most open source and industrial codebases don’t have a benchmark suite run on every release. They don’t have a way of catching performance regressions.

        For example, my memory is that CPython only got some detailed perf measurements with Unladen Swallow, which was 15+ years into CPython’s existence. And then I think that suite died until Victor Stinner revived it. I’m not very close to CPython development so I’d appreciate any corrections. But I think many highly used open source projects are like that.

        1. 4

          I disagree but on one point: I’ve yet to see instances where Python code is noticeably smaller than Rust code that does the same thing. I did have the pleasure to port some Python code to Rust, and the resulting lines of code were more or less the same, give or take a few. Where python code can often be more dense because not writing types, Rust can often win on iterator-based one-liners replacing Python for-loops of small to medium complexity. And if you do text processing, with the regex crate I can just directly lift 90% of Python’s regexes without any change. Finally, you can remove a lot of defensive code and tests around type mismatches, because the compiler simply rules them out. So if anything, I’d expect a well-written Rust port of a well-written Python codebase north of 100k lines of code to be smaller, not larger.

          1. 1

            Definitely disagree - I mean I just did a port from Rust to typed Python – Python is definitely smaller:

            https://github.com/andychu/rsc-regexp/blob/master/py/README.md (look a post2nfa() and match() functions)

            I’d definitely like to see some examples. I simply don’t believe it – the Python code must have had something wrong with it.

            Inherently, Rust code has more information about types and ownership, which makes it much faster – it more or less has to be longer.

            Iterators and list comprehensions / generator expressions seem to be a wash to me.

            Also look at David Beazley’s WASM engine in Python. Beazley writes extremely compact Python code. Or Norvig’s spell checker. Those aren’t idiomatic, but they will tell you how concise Python can be. Rust can never be that concise (and that isn’t the goal of the language).

            1. 2

              I simply don’t believe it

              That’s fair, I didn’t link to examples. Most I wish I could show you, but they’re not open, and the only open I could find (the update_lints script from rust-clippy, which I initially wrote in python) has now grown into its own tool with far more functionality, which also obviously means more lines of code.

              Still, you first say:

              the Python code must have had something wrong with it.

              and then go on to list examples, noting that

              Those aren’t idiomatic

              I was obviously talking about idiomatic production code, not golfing code or short off-the-cuff scripts. I don’t think that Rust is a good language for code golfing, but that’s missing the point. If you take idiomatic python code for, say, some web service (what’s your favorite Python web framework these days?), and then compare with a Rust web service (using e.g. axum), you will very likely end up with roughly the same number of lines of code. At some places Python will be more concise due to not using brackets and having more features, and at others Rust will save some time for having type information, thus needing less tests and for having high-level crates that offer concise APIs.

              1. 1

                Sure there is some Python that is bloated, but I’ve used Flask, and it’s very short. It’s actually shorter than what I used to write myself, because it doesn’t bother with dependency injection or anything, it’s more PHP style

                https://github.com/oilshell/hashdiv/blob/master/hashdiv.py

                I haven’t seen axum, but I just really doubt it can be shorter than Python. Rust code just inherently has more information, to be faster, and it has to go somewhere.

                1. 1

                  Have a look at https://docs.rs/axum/latest/axum/, or if you want something more flask-like, maybe https://rocket.rs/ is a good choice for you.

                  Like with flask’s decorators, it uses proc macros to generate the “more information” behind the scenes.

          2. 1

            I’ve had a similar experience with parsers. I wrote an HTML parser in Lua+LPEG, and while I could live with the speed, I found the memory usage insane. I then rewrote the parser in C (using a peg library with some customizations around case-insensitive matching) and not only did the memory usage drop, but it was much faster. Another example (and this time I have some numbers)—I wrote an 8-bit CPU assembler in Lua+LPEG (1800 lines of code). It works, but it wasn’t a speed demon. I rewrote it in C (no peg library this time—just straight C, 2800 lines of code) and the difference is incredible. On one sample program with 13,000 forward references (it’s a two-pass assembler), the Lua version takes 8.6s to run, the C version 0.15s.

            But the article seems to center around performance optimizations. There are other axes one could optimize around. I’m doing some recreational 8-bit programming (thus the assembler) and the classic one there is a space-time tradeoff. I can make the code faster but at the expense of memory usage (specific example—I writing a simple video game for a machine that has no hardware sprites, so it’s either save memory and take time bit-shifting sprite data at runtime, or save time by having multiple copies of sprite data pre-shifted, which eats up memory like crazy).

            1. 1

              Yeah I recently wrote this “rant” about language processors needing AOT-compiled speed (C++, Rust, Go, …) - https://news.ycombinator.com/item?id=35045520

              That is based on my experience with Oils. I’m relatively happy with our “mycpp” solution, but I certainly wish something like that already existed!

              Strings and tree/graph-based workloads are indeed pretty bad for Python and Lua. String objects are just large and GC/ref counting is expensive.

              And I think the JITs do much less for those workloads than they do for the numeric kinds of workloads.


              We’re looking at putting in small string optimization and pointer tagging into the mycpp runtime. For some reason I had a blind spot there, because I mainly read Python and Lua code, and neither of them has it.

              The optimization is to store strings of 7 bytes or less as “immediates” – in the pointer itself, with no indirection.

              Lua doesn’t have pointer tagging or SSO because there’s no way to do it in pure ANSI C. And Python can’t do it because PyObject* is littered all over extension APIs.

              But I think it’s a very significant optimization that helps with both GC pressure and cache utilization. It should help parsing/string workloads A LOT in those types of languages.

              I also think it’s a bit more natural in C++ than C. Most people I talked to have heard about it through a CppCon talk ~10 years ago I think. I’d be interested to see any C codebases which have such a string type.

              1. 2

                Nit: some Python implementations have it, like MicroPython and Skybison ;) Just not CPython.

                1. 2

                  Yes very true! Thanks for pointing me in that direction

                  I did also hack on femtolisp, and it has pointer tagging. I find it a bit funny how different it looks in C-style vs. C++-style. I think it will be cool to have a code generator for tagged pointers

                  Related paper I posted on this topic: https://lobste.rs/s/suafzq/bit_stealing_made_legal_compilation_for

          3. 3

            I completely agree with the four kinds of optimisation, though the first and second are often intertwined. Great discussion on the difference between best case and common case on your data - that’s often critical and is largely overlooked in university classes.

            Option 4 is the most interesting one. Apple’s FSEvents mechanism scales a lot better than Linux’s inotify because it tells you only the directories that have changed, not the individual files. You can then look at the modification time of the files to see which have changed since your last epoch for that directory. This also has the advantage that it does some coalescing for free. I’d be inclined to subdivide these into two categories:

            • Accept less precision in a first round that can be recaptured later.
            • Accept less precision because you don’t actually care.

            FSEvents is an example of the fist case. You don’t care which files are modified because traversing a directory and looking at the modification times is cheap (or, at least, less likely to cause performance problems than blocking writes until an event notification consumer has caught up). In contrast, using lower precision floating point for coordinates in a game because users won’t notice if an object is less than one pixel away from its ‘correct’ position is an example of the latter. The first has some overlap with choosing a better algorithm.

            I think there’s a higher-level category, which is ‘reevaluate your requirements’. Some of the biggest performance gains that I’ve made have come from applying the techniques in the article to a system that was already carefully optimised. Between the time that it was written and the time that I’d looked at it, the requirements had changed significantly. Sometimes this applies to the input changing as it interfaces to a different system (you built a thing for processing SMS, now it’s used with WhatsApp, suddenly short 7-bit ASCII strings are replaced with longer UTF-8 sequences). Sometimes it’s that the hardware changed (globals got slower, passing arguments got cheaper, parallelism became the big win, cache misses got really expensive, branch predictor performance became critical).

            1. 2

              reevaluate your requirements

              Indeed. This is also, as I discuss here, an argument in favour of abstraction: maybe low-level requirements can be changed without affecting higher-level ones at all.

              1. 1

                Thanks! And I agree that “allow some incorrectness” might better be thought of as “relax your requirements”.

              2. 3

                Huh, funny. I thought the four kinds of optimizations are:

                1. Premature optimizations.
                2. optimizations at the detriment of security.
                3. late optimizations.
                4. optimizations that lead to instability.
                1. 2

                  …to checking that you’ve got compiler optimisations turned on…

                  This made me chuckle a bit because I have come across more than a few people that refuse to use compiler optimizations because it “ruins their code”. Almost invariably they also grumble that the compiler does not produce good code, hence feeding back into their position against using optimizations.

                  1. 1

                    “Doctor, it hurts when I do this!”

                    “Stop doing that!”

                  2. 1

                    Regarding impossibly large state spaces, it’s interesting to consider that someone could accidentally happen upon the optimum result. It would be a complete fluke, but people would probably still regard them as a genius. Same thing goes with human intelligence in the general case, I suppose. (If I keep going down that line of thinking I might start to depersonalize.)


                    Regarding bloom filters, does anyone know if there is a data structure that gives you false negatives and the same compression profile? I feel like it must be impossible. If it is impossible, the asymmetry is interesting and makes me wonder if there is a deeper idea there.

                    1. 1

                      and the same compression profile

                      Ultimately it depends on how strictly you mean “the same compression profile” and some other assumptions. There is also the 10+X faster thing only 2-3X larger (at similar false answer probabilities) data structure mentioned in the original 1970 paper by Harold Bloom but with a poor time analysis (even for the era, represented as footnotes likely alluding more to word-wise memory fetch than cache hierarchy). Along with focus on space at the expense of time, the poor time analysis has left communication of the ideas.. troubled for decades.

                      To be concrete/self-contained, that alternative is just a table of hashes truncated to B-bits aka fingerprints, i.e. an array of B-bit numbers with some zeros in there and impossibility of fingerprinting to zero). Bit-wise CPU operations can make managing arrays of B-bit numbers efficient and somewhat straightforward. And with a little more work (Robin Hood Linear Probing) you can get down to almost always 1 cache miss.

                      Given the style of your thinking here, you should perhaps read about the https://en.wikipedia.org/wiki/Cuckoo_filter . That relies on a strong hash function to get results better in some but not all ways (you cannot beat almost always 1 memory fetch with a fast non-crypto hash, but a strong enough fast enough hash can probably use a bit less space with Cuckoo).

                      If you are the type to take time to implement and measure yourself and care more about time than space, a word of caution is warranted: with modern CPUs, one cannot just loop over keys that a CPU can know ahead of time (via either pre-fetching or speculative execution - like a random number generator or a buffer from /dev/random) to get times reflective of “real life” memory latency { where one of these probabilistic filters is “in front of” some much slower way to respond whose input is truly unpredictable, like the contents of a network packet }. There is a demo of this effect, where an RNG is not unpredictable enough because the CPU can “work ahead of itself”, is here. { Methodologically it may be easier to measure search depth in units of cache lines. }

                      1. 1

                        Regarding the compression profile, I misspoke, I really meant “is it possible to get a data structure with compression that gives you false negatives instead of false positives?” Intuitively, false positives seem to allow for structural sharing of information, but I don’t know if my intuition is correct.

                        Regarding the Cuckoo Filter, I believe it also gives false positives, not false negatives. Is there a name for the other data structure you mention? Looking for the term bit level table, I didn’t find any results.

                        1. 1

                          If you cannot flip an “outer boolean construction” to get false negatives that way (like in the cache vs. not in the cache), and mean just now above “instead of” strictly, then it may bear noting that the asymmetry comes from “determinism of registration of presence”. Really a fingerprint is itself just another hash function in a reduced range, much as any table index reduced by modulo/whatnot table length is also like that and sometimes causes terminological confusion with “hash codes” and the like.

                          So, for fingerprints the asymmetry just comes from the hash function itself being deterministic, meaning same key => same hash. Same hash does not imply same key due to range reduction by a simple argument. The structural space sharing in Bloom filters to get a slightly more compact setup that costs much more time might be viewed as a “collective hash”, but I think the determinism asymmetry part is the same. You just have N deterministic but compressing/lossy hash functions.

                          I terms of “symmetrizing”, there are bijective integer <-> integer hash functions (passing various pseudorandom tests, yet being invertible), but I think it’s impossible to get data compression that way (other than lossless compression like Huffman coding).. Meanwhile you can construct a false negative (in the sense I think you mean it) with a complete/non-truncated table and a random oracle that just fails however often, but I don’t know this does what you want it to do, conceptually and you also get no compression.

                          FWIW, I think the asymmetry in question is fundamental and almost isomorphic to the Arrow Of Time in thermodynamics / statistical mechanics - there a necessary ignorance of which microstate is behind which macrostate forces probabilistic reasoning (e.g. averaging over “all possible” microstates consistent with the macrostate, in some sense). No idea of your background or if that was helpful. In the key-membership query situation, “creating the information” is kind of equivalent to “peeking into the future query stream”, violating causality (unless maybe you know the keys but not their order of presentation?? one can view a perfect hash function as a sort of way to losslessly compress a fixed dataset).

                          Another way to put it or think about it might be “lossy compression can always achieve better ratios than lossless” (by the simple argument that it represents less information). I’m trying to help, but not sure I am succeeding. I worry that the reason this reverse direction seems interesting is because if it were true a lot of other seemingly very true things would be false which is often a red flag { and sometimes a proof technique - by contradiction :-) }.

                          1. 1

                            Maybe it would help if I explained my reasoning a bit.

                            Let’s start with a basic hash set. You can add items to it. They will be hashed and the value will be stored in some place within the hash set. You can check any object of the right type to see if it is present in the hash set and learn whether or not it is in the hash set (no false positives or negatives). This is because the actual object is stored in the hash set. You hash the object to find the position in the hash set and then you can check if the object in that place is that same as the object you are looking for. If it isn’t there might be some chaining that needs to happen.

                            Needing to store the whole object is a lot of space though. What if we threw away the object and just store the hash? There would be collisions (false positives), but they are unlikely to occur. We can use the same chaining logic to disambiguate items that end up in the same slots in the backing array.

                            What if we wanted to compress the set even further? Instead of storing the whole number, we could use a bit array and set the hash of the item to 1. This might cause too many collisions (false positives) though. We could use a few hash functions to create a unique finger print and store all of those. Then when we check the item in the hash set we could run multiple hash functions and make sure all of the bits were set. Great! We’ve reinvented a Bloom filter!


                            Now what I started thinking was that we started with a system that could tell you perfectly whether or not something was in the set (because the whole item is stored in the set). We introduced a form of inaccuracy called false positives (it says the item is in the set, even though it isn’t) to get compression. False negatives (it says the item is not in the set, even though it is) seem symmetrical to false positives. So could we have taken the other fork and gotten some other property? If so, what? If not, why?

                            In actually writing down the steps we took to get from a hash set to a Bloom filter, I called out two places that add false positives. Hashing allows us to reduce infinite inputs into finite outputs at the cost of information loss (arrow of time). If we used hashing to produce false positives, would the inverse of hashing produce false negatives? Would false negatives require us to take finite inputs and return infinite outputs? Would we be moving from a deterministic process to a stochastic one?

                            So maybe trying to move from an accurate hash set to a false negative producing “Bloom filter” (nega-bloom filter?) wouldn’t have any practical benefit?

                            Does this line of reasoning make sense? Or am I completely off in the woods?

                            1. 1

                              I don’t think I can explain any better than that the forward process with false positives “loses data on purpose”, but the “symmetric” process you seem to be imagining “un-loses” data which just cannot work unless backward time travel can { which many sci-fi authors find interesting/deep :-) and many physicists find The Arrow Of Time fascinating. }.

                              In your own words, yes, you move from a deterministic process to a stochastic one where perhaps confusingly the possibility of collisions does bring probability analysis into the deterministic process while in the stochastic one it is more brought in not only by accuracy but the data structure itself because you can never know if you were right, by construction.

                              Relatedly, you cannot keep applying a lossless compression algorithm and expect similar gains in information density (or any gains, if the first re-encode is good!). With a lossy one, as here, you can “do it again” in the forward direction, losing more & more data until at some point everything is guaranteed to collide. You may like reading about information theory.

                              1. 1

                                “Un-losing” data definitely isn’t possible. But if you think about the hash set I start out with in my previous comment, hashing is an optimization for finding the information quickly, but you do keep a full copy of the data around so that information isn’t really lost yet. It’s only lost when you transition to storing just the hash of an object. That’s where we make the active decision to trade false positives for data compression. If it helps, you could think stop thinking of the initial data structure as a hash set and start thinking about it as more of a bag with an O(n) look up time.

                                So, if you start with a bag and think about adding false negatives, what would that look like and what properties would it give you? If those properties are worthwhile, then maybe you could apply search optimizations to the data structure on top of that.

                                ** edit ** And I probably do need to take some time to learn some information theory. I don’t have a very rigorous mathematical background.

                    2. 1

                      does anyone know if there is a data structure that gives you false negatives and the same compression profile?

                      Do you mean it can give false negatives but no false positives? (Presumably if you could get both false negatives and false positives it would be hard to find a use case.) That would be quite interesting. I could see it useful for things like authentication (most authentication checks in apps pass, so getting a “yes” most of the time could be a valuable cache).

                      1. 1

                        Yeah! Having just false negatives could give the data structure a different set of use cases. Security would be a good example of something that could tolerate false negatives, but not false positives. Also, it’s just interesting to consider if there is a deeper relationship between compression, false positives, and structural sharing of information. As well as the question of what properties the inverse process would yield. Mostly this is just idle curiosity, honestly.

                        False positives and false negatives in a single data structure wouldn’t be too hard. All you need to do is add deletions to a Bloom filter. If you could delete any object then it would make all of its bit array collisions false negatives. But obviously a data structure that refuses to give you a straight answer is not particularly useful.

                        I’m sorry I didn’t notice the notification for your reply before, but I’ve been messaging with cblake in a sister thread about this. If you have any thoughts on the matter, you are more than welcome to join in.

                        1. 2

                          All you need to do is add deletions to a Bloom filter

                          Did I miss something big, or is this suggesting one just “solve a major outstanding problem in computer science”?

                          1. 1

                            I don’t think I’m solving a major problem. I think I’m just adding an operation that breaks the data structure. In my mind, the delete operation takes an object and a bloom filter. It produces the fingerprint on the object, checks to see if the object might be in the bloom filter, and then zeros out the locations. This could also zero out parts of other fingerprints so you would no longer be able to detect them despite them having been added to the data structure. This is what adds the false negative property on top of the false positive property. Unfortunately, it’s just kinda useless.

                    🇬🇧 The UK geoblock is lifted, hopefully permanently.