1. 48
    1. 24

      I agree with most of what’s said in the article. However, it misses a bit the forest for the trees, at least considering the introduction. Programs that take seconds to load or display a list are not just ignoring cache optimizations. They’re using slow languages (or language implementations, for the pedantics out there) like cpython where even the simplest operations require a dictionary lookup, or using layers and layers of abstractions like electron, or making http requests for the most trivial things (I suspect it’s what makes slack slow; I know it’s what makes GCP’s web UI absolutely terrible). A lot of bad architectural choices, too.

      Cache optimizations can be important but only as the last step. There’s a lot to be fixed before that, imho.

      1. 16

        Even beyond than that, I think there are more more baseline things going on: Most developers don’t even benchmark or profile. In my experience the most egregious performance problems I’ve seen have been straight-up bugs, and they don’t get caught because nobody’s testing. And the profiler basically never agrees with what I would have guessed the problem was. I don’t disagree with the author’s overall point, but it’s rare to come across a program that’s slow enough to be a problem that doesn’t have much lower hanging fruit than locality issues.

        1. 3

          I agree so much! I’d even say that profiling is one half of the problem (statistical profiling, that is, like perf). The other half is tracing, which nowadays can be done with very convenient tools like Tracy or the chrome trace visualizer (“catapult”) if you instrument your code a bit so it can spit out json traces. These give insights in where time is actually spent.

        2. 1

          Absolutely. Most developers only benchmark if there’s a serious problem, and most users are so inured to bad response times that they just take whatever bad experience they receive and try to use the app regardless. Most of the time it’s some stupid thing the devs did that they didn’t realize and didn’t bother checking for (oops, looks like we’re instantiating this object on every loop iteration, look at that.)

      2. 9

        Programs that take seconds to load or display a list are not just ignoring cache optimizations.

        That’s right. I hammered on the cache example because it’s easy to show an example of what a massive difference it can make, but I did not mean to imply that it’s the only reason. Basically, any time we lose track of what the computer must do, we risk introducing slowness. Now, I don’t mean that having layers of abstractions or using dictionary are inherently bad (they will likely have a performance cost, but it may be reasonable to reach another objective), but we should make these choices intentionally rather than going by rote, by peer pressure, by habit, etc.

        1. 5

          The article implies the programmer has access to low level details like cache memory layout, but if you are programming in Python, Lua, Ruby, Perl, or similar, the programmer doesn’t have such access (and for those languages, the trade off is developer ease). I’m not even sure you get to such details in Java (last time I worked in Java, it was only a year old).

          The article also makes the mistake that “the world is x86”—at work, we still use SPARC based machines. I’m sure they too have cache, and maybe the same applies to them, but micro-optimizations are quite difficult across different architectures (and even across the same family but different generations).

          1. 6

            The article implies the programmer has access to low level details like cache memory layout, but if you are programming in Python, Lua, Ruby, Perl, or similar, the programmer doesn’t have such access

            The level of control that a programmer has is reduced in favor of other tradeoffs, as you said, but there’s still some amount of control. Often, it’s found in those languages best practices. For example, in Erlang one should prefer to use binaries for text rather than strings, because binaries are a contiguous sequence of bytes while strings are linked lists of characters. Another example, in Python it’s preferable to accumulate small substrings in a list and then use the join method rather that using concatenation (full += sub).

            The article also makes the mistake that “the world is x86”—at work, we still use SPARC based machines. I’m sure they too have cache, and maybe the same applies to them, but micro-optimizations are quite difficult across different architectures (and even across the same family but different generations).

            I don’t personally have that view, but I realize that it wasn’t made very clear in the text, my apologies. Basically what I want myself and other programmers to be mindful of is mechanical sympathy — to not lose track of the actual hardware that the program is going to run on.

            1. 4

              I know a fun Python example. Check this yes implementation:

              def yes(s):
                p = print
                while True:

              This hot-loop will perform significantly better than the simpler print(s) because of the way variable lookups work in Python. It first checks the local scope, then the global scope, and then the built-ins scope before finally raising a NameError exception if it still isn’t found. By adding a reference to the print function to the local scope here, we reduce the number of hash-table lookups by 2 for each iteration!

              I’ve never actually seen this done in real Python code, understandably. It’s counter-intuitive and ugly. And if you care this much about performance then Python might not be the right choice in the first place. The dynamism of Python (any name can be reassigned, at any time, even by another thread) is sometimes useful but it makes all these lookups necessary. It’s just one of the design decisions that makes it difficult to write a high-performance implementation of Python.

              1. 3

                That’s not how scoping works in Python.

                The Python parser statically determines the scope of a name (where possible.) If you look at the bytecode for your function (using dis.dis) you will see either a LOAD_GLOBAL, LOAD_FAST, LOAD_DEREF, or LOAD_NAME, corresponding to global, local, closure, or unknown scope. The last bytecode (LOAD_NAME) is the only situation in which multiple scopes are checked, and these are relatively rare to see in practice.

                The transformation from LOAD_GLOBAL to LOAD_FAST is not uncommon, and you see it in the standard library: e.g., https://github.com/python/cpython/blob/main/Lib/json/encoder.py#L259

                I don’t know what current measurements of the performance improvement look like, after LOAD_GLOBAL optimisations in Python 3.9, which reported 40% improvement: https://bugs.python.org/issue26219 (It may be the case that the global-to-local transformation is no longer considered a meaningful speed-up.)

                Note that the transformation from global-to-local scope, while likely innocuous, is a semantic change. If builtins.print or the global print is modified in some other execution unit (e.g., another thread,) the function will not reflect this (as global lookups can be considered late-bound, which is often desirable.)

                1. 8

                  I think this small point speaks more broadly to the dissatisfaction many of us have with the “software is slow” mindset. The criticisms seem very shallow.

                  Complaining about slow software or slow languages is an easy criticism to make from the outside, especially considering that the biggest risk many projects face is failure to complete or failure to capture critical requirements.

                  Given a known, fixed problem with decades of computer science research behind it, it’s much easier to focus on performance—whether micro-optimisations or architectural and algorithmic improvements. Given three separate, completed implementations of the same problem, it’s easy to pick out which is the fastest and also happens to have satisfied just the right business requirements to succeed with users.

                  I think the commenters who suggest that performance and performance-regression testing should be integrated into the software development practice from the beginning are on the right track. (Right now, I think the industry is still struggling with getting basic correctness testing and documentation integrated into software development practice.)

                  But the example above shows something important. Making everything static or precluding a number of dynamic semantics would definitely give languages like Python a better chance at being faster. But these semantics are—ultimately—useful, and it may be difficult to predict exactly when and where they are critical to satisfying requirements.

                  It may well be the case that some languages and systems err too heavily on the side of allowing functionality that reduces the aforementioned risks. (It’s definitely the case that Python is more dynamic in design than many users make use of in practice!)

                2. 2

                  Interesting! I was unaware that the parser (!?) did that optimization. I suppose it isn’t difficult to craft code that forces LOAD_NAME every time (say, by reading a string from stdin and passing it to exec) but I find it totally plausible that that rarely happens in non-pathological code.

                  Hm. For a lark, I decided to try it:

                  >>> def yes(s):
                  ...  exec("p = print")
                  ...  p(s)
                  >>> dis.dis(yes)
                    2           0 LOAD_GLOBAL              0 (exec)
                                2 LOAD_CONST               1 ('p = print')
                                4 CALL_FUNCTION            1
                                6 POP_TOP
                    3           8 LOAD_GLOBAL              1 (p)
                               10 LOAD_FAST                0 (s)
                               12 CALL_FUNCTION            1
                               14 POP_TOP
                               16 LOAD_CONST               0 (None)
                               18 RETURN_VALUE
                  >>> yes("y")
                  Traceback (most recent call last):
                    File "<stdin>", line 1, in <module>
                    File "<stdin>", line 3, in yes
                  NameError: name 'p' is not defined
          2. 5

            and for those languages, the trade off is developer ease

            I heard Jonathan Blow make this point on a podcast and it stuck with me:

            We’re trading off performance for developer ease, but is it really that much easier? It’s not like “well, we’re programming in a visual language and just snapping bits together in a GUI, and it’s slow, but it’s so easy we can make stuff really quickly.” Like Python is easier than Rust, but is it that much easier? In both cases, it’s a text based OO language. One just lets you ignore types and memory lifetimes. But Python is still pretty complicated.

            Blow is probably a little overblown (ha), but I do think we need to ask ourselves how much convenience we’re really buying by slowing down our software by factors of 100x or more. Maybe we should be more demanding for our slow downs and expect something that trades more back for it.

            1. 2

              Like Python is easier than Rust, but is it that much easier?

              I don’t want to start a fight about types but, speaking for myself, Python became much more attractive when they added type annotations, for this reason. Modern Python feels quite productive, to me, so the trade-off is more tolerable.

            2. 1

              It depends upon the task. Are you manipulating or parsing text? Sure, C will be faster in execution, but in development?

              At work, I was told to look into SIP, and I started writing a prototype (or proof-of-concept if you will) in Lua (using LPeg to parse SIP messages). That “proof-of-concept” went into production (and is still in production six years later) because it was “fast enough” for use, and it’s been easy to modify over the years. And if we can ever switch to using x86 on the servers [1], we could easily use LuaJIT.

              [1] For reasons, we have to use SPARC in production, and LuaJIT does not support that architecture.

      3. 7

        The trick about cache optimizations is that that can be a case where, sure, individually you’re shaving nanoseconds off, but sometimes those are alarmingly common in the program flow and worth doing before any higher-level fixes.

        To wit: I worked on a CAD system implemented in Java, and the “small optimization” of switching to a pooled-allocation strategy for vectors instead of relying on the normal GC meant the difference between an unusable application and a fluidly interactive one, simply because the operation I fixed was so core to everything that was being done.

        Optimizing cache hits for something like mouse move math can totally be worth it as a first step, if you know your workload and what code is in the “hot” path (see also sibling comments talking about profiling).

      4. 6

        They’re using slow languages (or language implementations, for the pedantics out there) like cpython where even the simplest operations require a dictionary lookup

        I take issue with statements like this, because the majority of code in most programs is not being executed in a tight loop on large enough data to matter. The overall speed of a program has more to do with how it was architected than with how well the language it’s written in scores on microbenchmarks.

        Besides, Python’s performance cost isn’t a just an oversight. It’s a tradeoff that provides benefits elsewhere in flexibility and extensibility. Problems like serialization are trivial because of meta-programming and reflection. Complex string manipulation code is simple because the GC tracks references for you and manages the cleanup. Building many types of tools is simpler because you can easily hook into stuff at runtime. Fixing an exception in a Python script is a far more pleasant experience than fixing a segfault in a C program that hasn’t been built with DWARF symbols.

        Granted, modern compiled languages like Rust/Go/Zig are much better at things like providing nice error messages and helpful backtraces, but you’re paying a small cost for keeping a backtrace around in the first place. Should that be thrown out in favor of more speed? Depending on the context, yes! But a lot of code is just glue code that benefits more from useful error reporting than faster runtime.

        For me, the choice in language usually comes down to how quickly I can get a working program with limited bugs built. For many things (up to and including interactive GUIs) this ends up being Python, largely because of the incredible library support, but I might choose Rust instead if I was concerned about multithreading correctness, or Go if I wanted strong green-thread support (Python’s async is kinda meh). If I happen to pick a “fast” language, that’s a nice bonus, but it’s rarely a significant factor in that decision making process. I can just call out to a fast language for the slow parts.

        That’s not to say I wouldn’t have mechanical sympathy and try to keep data structures flat and simple from the get go, but no matter which language I pick, I’d still expect to go back with a profiler and do some performance tuning later once I have a better sense of a real-world workload.

        1. 4

          To add to what you say: Until you’ve exhausted the space of algorithmic improvements, they’re going to trump any microoptimisation that you try. Storing your data in a contiguous array may be more efficient (for search, anyway - wait until you need to insert something in the middle), but no matter how fast you make your linear scan over a million entries, if you can reframe your algorithm so that you only need to look at five of them to answer your query then a fairly simple data structure built out of Python dictionaries will outperform your hand-optimised OpenCL code scanning the entire array.

          The kind of microoptimisation that the article’s talking about makes sense once you’ve exhausted algorithmic improvements, need to squeeze the last bit of performance out of the system, and are certain that the requirements aren’t going to change for a while. The last bit is really important because it doesn’t matter how fast your program runs if it doesn’t solve the problem that the user actually has. grep, which the article uses as an example, is a great demonstration here. Implementations of grep have been carefully optimised but they suffer from the fact that requirements changed over time. Grep used to just search ASCII text files for strings. Then it needed to do regular expression matching. Then it needed to support unicode and do unicode canonicalisation. The bottlenecks when doing a unicode regex match over a UTF-8 file are completely different to the ones doing fixed-string matching over an ASCII text file. If you’d carefully optimised a grep implementation for fixed-string matching on ASCII, you’d really struggle to make it fast doing unicode regex matches over arbitrary unicode encodings.

          1. 1

            The kind of microoptimisation that the article’s talking about makes sense once you’ve exhausted algorithmic improvements, need to squeeze the last bit of performance out of the system, and are certain that the requirements aren’t going to change for a while.

            To be fair, I think the article also speaks of the kind of algorithmic improvements that you mention.

        2. 3

          Maybe it’s no coincidence that Django and Rails both seem to aim at 100 concurrent requests, though. Both use a lot of language magic (runtime reflection/metaprogramming/metaclasses), afaik. You start with a slow dynamic language, and pile up more work to do at runtime (in this same slow language). In this sense, I’d argue that the design is slow in many different ways, including architecturally.

          Complex string manipulation code is simple because the GC tracks references for you

          No modern language has a problem with that (deliberately ignoring C). Refcounted/GC’d strings are table stakes.

          I personally dislike Go’s design a lot, but it’s clearly designed in a way that performance will be much better than python with enough dynamic features to get you reflection-based deserialization.

      5. 1

        All the times I had an urge to fire up a profiler the problem was either an inefficient algorithm (worse big-O) or repeated database fetches (inefficient cache usage). Never have I found that performance was bad because of slow abstractions. Of course, this might be because of software I work with (Python web services) has a lot of experiences on crafting good, fast abstractions. Of course, you can find new people writing Python that don’t use them, which results in bad performance, but that is quickly learned away. What is important if you want to write performant Python code, is to use as little of “pure Python” as possible. Python is a great glue language, and it works best when it is used that way.

        1. 1

          Never have I found that performance was bad because of slow abstractions.

          I have. There was the time when fgets() was the culprit, and another time when checking the limit of a string of hex digits was the culprit. The most surprising result I’ve had from profiling is a poorly written or poorly compiled library.

          Looking back on my experiences, I would have to say I’ve been surprised by a profile result about half the time.

      6. 1

        As a pedantic out here, I wanted to say that I appreciate you :)

    2. 6

      The quote from SICP stuck out at me:

      “programs must be written for people to read, and only incidentally for machines to execute.”

      I think we may have gone too far in this direction, and are perhaps at the illuminated manuscript (javascript?) stage of things.

      Hot take: given how little of the iceberg of our dependencies and libraries we see, I would posit that most people don’t read programs in any meaningful holistic way at all.

      1. 5

        I would have called Literate Programming the “Illuminated Manuscript” of our art.

      2. 2

        I’m struggling to understand this reply. We’ve made programs so readable that nobody bothers reading them?

        1. 3

          I think we’ve lost our empathy for the machines, and have hit a degree of perfomative coding where we perhaps are writing programs for other programmers rather than for machines.

          Which is fine, until you consider that the programmers won’t be executing the programs and the Earth is drowning in e-waste and waste heat.

    3. 6

      There’s something very interesting here I’m finding hard to reconcile both in this article and the discussion around it. I’m looking at how we use most management frameworks and they usually try to converge on binary criteria for acceptance in the “definition of done”:

      • Is it documented?
      • Does it work?
      • Did it pass QA?
      • Did is pass acceptance / style tests?

      Is it fast or efficient is not an easily captured or measured. Especially before the fact. When it comes to metrics, one would believe that money spent would be a good, albeit last ditch, indicator of how efficient a particular piece of code is, but I’ve found that this lack expertise also permeates business management. I’m still shocked as I move across the industry that most businesses have little to no idea of how much it costs to run their product. They know (most of the time) when they stop making a profit, but rarely have a full picture of why they are profitable.

      I’m not sure it’s reasonable to target efficiency at scale, but it can certainly be used as an edge to those who can capture it. I think that for one to understand efficiency, it requires a holistic view, which I don’t think an assembly chain modern work environment actually encourages.

      In the end, “smart” compilers seem like the most efficient way of addressing inefficiency, but education would certainly drive the industry in a better direction.

    4. 4

      Nice read and I agree with your main point! However, I feel that the code example you give actually argues against the point you’re trying to make with it.

      We often hear that programmers can’t beat an optimizing compiler, but in this case we did.

      This is not entirely true. While the by_row function obviously performs better than the by_col variant, neither of them gives the compiler much room to optimize, precisely because they concern themselves with implementation details.

      A sufficiently abstracted variant of this function that does allow the compiler to make some choices, would be

      fn by_it(v: &[f32]) -> f64 {
          v.iter().fold(0.0, |acc, x| acc + (*x) as f64)

      Which, in this case, translates to the “optimal” handwritten variant.

      So, while I agree that we should give some priority to executability and not just readability of code, I’m not sure the solution would be to hard-code more implementation details of a system that we often don’t know the details of. Instead, a more abstract representation would allow the compiler, with knowledge of the target system, to actually make somewhat optimal choices.

    5. 3

      This is one of those things you can’t solve by writing a text, no matter how smart. Both SICP and the author are right, but it’s a question of when to optimize and how much. And it only can be learned over time.

    6. 2

      I’m very happy to see this discussion popping up. We are so far gone on resource wastfullness. Beyond any reasonable limit. I see kubernet a kluaters with dozens of nodes and combined terabytes of RAM running software that does essentially the same as as a PHP script 15 years ago running on a shared hosting account.

      I look at chrome, which essentially has the same core functionality as it had in its inception but requires an order of magnitude more of resources to be usable.

      We have to stop this. If not because it’s non sense, perhaps to spare the planet of a gigantic abuse of resources.

      Lobaters. What can we do? How do we fight this tendency? Should we start a club/culture/meme/religion/whatever for resource-friendly software? How/where to we start?

    7. 2

      The list of fast things that won should include Friendster, which died because it was slow.

    8. 2

      Programmer time is more expensive than CPU cycles. Whining about it isn’t going to change anything, and spending more of the expensive thing to buy the cheap thing is silly.

      1. 15

        The article makes a good counterpoint:

        People migrate to faster programs because faster programs allow users to do more. Look at examples from the past: the original Python-based bittorrent client was quickly overtaken by the much faster uTorrent; Subversion lost its status as the premier VCS to Git in large part because every operation was so much faster in Git; the improved grep utility, ack, is written in Perl and waning in popularity to the faster silversurfer and ripgrep; the Electron-based editor Atom has been all but replaced by VSCode, also Electron-based, but which is faster; Chrome became the king of browsers largely because it was much faster than Firefox and Internet Explorer. The fastest option eventually wins. Would your project survive if a competitor came along and was ten times faster?

        1. 7

          That fragment is not great in my opinion. Svn-git change is about the whole architecture not about implementation speed. A lot of speedup in that case comes from not going to the server for information. Early git was mainly shell and perl too so it doesn’t quite mesh with the python example before. Calling out Python for BitTorrent is not a great example either - it’s an io-heavy app rather than processing heavy.

          Vscode has way more improvements over atom and available man-hours. If it was about performance, sublime or some other graphical editor would take over from them.

          I get the idea and I see what the author is aiming for, but those examples don’t support the post.

          1. 3

            I was an enthusiastic user of BitTorrent when it was released. uTorrent was absolutely snappier and lighter than other clients. Specifically the oficial Python GUI. It blew the competition out of the watter because it was superior in its pragmacy. Perhaps python Vs c is an oversimplification. The point would still hold even in the presence of two programs written in the same language.

            The same applies for git. It feels snappy and reliable. Subversion and cvs, besides being slow and clunky, would gift you a corrupted repo every other Friday afternoon. Git pulverised this non sense brutally quick.

            The point is about higher quality software built with better focus, making reasonable use of resources, resulting in superior experience for the user. Not so much about a language being better than others.

          2. 2

            BitTorrent might seem IO heavy these days; ironically this is because it has been optimised to death; but you are revising history if you think that it’s not CPU/Memory intensive and doing it in python would be crushingly slow.

            The point at the end is a good one though, you must agree:

            Would your project survive if a competitor came along and was ten times faster?

            1. 1

              I was talking about the actual process not the specific implementation. You can make BitTorrent cpu-bound in any language with inefficient implementation. But the problem itself is IO bound, so any runtime should also be able to get there. (Modulo the runtime overhead)

        2. 2

          This paragraph popped out at me as historically biased and lacking in citations or evidence. With a bit more context, the examples are hollow:

          • The fastest torrent clients are built on libtorrent (the one powering rtorrent), but rtorrent is not a very common tool
          • Fossil is faster than git
          • grep itself is more popular than any of its newer competitors; it’s the only one shipped as a standard utility
          • Atom? VSCode? vim and emacs are still quite popular! Moreover, the neovim fork is not more popular than classic vim, despite speed improvements
          • There was a period of time when WebKit was fastest, and browsers like uzbl were faster than either Chrome or Firefox at rendering, but never got popular

          I understand the author’s feelings, but they failed to substantiate their argument at this spot.

        3. 2

          This is true, but most programming is done for other employees, either of your company or another if you’re in commercial business software. These employees can’t shop around or (in most cases) switch, and your application only needs to be significantly better than whatever they’re doing now, in the eyes of the person writing the cheques.

          I don’t like it, but I can’t see it changing much until all our tools and processes get shaken up.

      2. 11

        But we shouldn’t ignore the users’ time. If the web app they use all day long take 2-3 seconds to load every page, that piles up quickly.

        1. 7

          While this is obviously a nuanced issue, personally I think this is the key insight in any of it, but the whole “optimise for developer happiness/productivity, RAM is cheap, buy more RAM (etc)” line totally ignores it. Let alone the “rockstar developer” spiel. Serving users’ purposes is what software is for. A very large number of developers lose track of this because of an understandable focus on their own frustrations, and tools that make them more productive are obviously valuable, as well as meaning they have a less shitty time, which is meaningful and valuable. But building a development ideology around that doesn’t make this go away. It just makes software worse for users.

          1. 7

            Occasionally I ask end-users in stores, doctor’s offices, etc what they think of the software they’re using, and 99% of the time they say “it’s too slow and crashes too much.”

            1. 2

              Yes, and they’re right to do so. But spending more programming time using our current toolset is unlikely to change that, as the pressures that selected for features and delivery time over artefact quality haven’t gone anywhere. We need to fix our tools.

          2. 5

            In an early draft, I cut out a paragraph about what I am starting to call “trickle-down devenomics”; this idea that if we optimize for the developers, users will have better software. Just like trickle-down economics, it’s just snake oil.

            1. 1

              Alternately, you could make it not political.

              Developers use tools and see beauty differently from normal people. Musicians see music differently, architects see buildings differently, and interior designers see rooms differently. That’s OK, but it means you need software people to talk to non-software people to figure out what they actually need.

      3. 3

        Removed because I forgot to reload and multiple others gave the same argument I did in the meantime already.

      4. 3

        I don’t buy this argument. In some (many?) cases, sure. But once you’re operating at any reasonable scale you’re spending a lot of money on compute resources. At that stage even a modest performance increase can save a lot of money. But if you closed the door on those improvements at the beginning by not thinking about performance at all, then you’re kinda out of luck.

        Not to mention the environmental cost of excessive computing resources.

        It’s not fair to characterize the author as “whining about” performance issues. They made a reasonable and nuanced argument.

      5. 3

        Yes. This is true so long as you are the only option. Once there is a faster option, the faster option wins.


        Not for victories in CPU time. The only thing more scarce and expensive than programmer time is…. User Time. Minimize user time and pin cpu usage at 100% and nobody will care until it causes user discomfort or loss of user time elsewhere.

        Companies with slow intranets cause employees to become annoyed, and cause people to leave at some rate greater than zero.

        A server costs a few thousand dollars on the high end. A smaller program costs a few tens of thousands to build and maintain and operate. That program can cost more than hundreds of thousands in management and engineer and sales and marketing and HR and quality and training and compliance salaries to use it over its life.

    9. 1

      I think the article misses a huge point why free and open source software exists. It’s a tradeoff between solving a problem you as the developer have and being motivated enough to actually solve it. Having fun and keeping active is a bonus.

      There are dozens of apps I’ve written in my life that should have been done in C or C++ or Java or Rust because of speed and memory and whatever. But either it was the wrong tool, or the wrong time, or I simply wanted to work with a different language. That’s why the thing exists in the first place, if I had been pressured into optimal performance it would not.

    10. 1

      We only needed to change the order of the loops, but the Rust compiler which uses the LLVM backend, did not do it. Maybe the famed sufficiently smart compiler would, but for the foreseeable future, arranging data in a way that it can be processed efficiently will not be the compiler’s job — it’ll be ours.

      Loop interchange isn’t an exotic optimization; there were compilers performing it in the 1980s. LLVM has a loop interchange pass that should do this, I’m not sure why it wasn’t triggered.

      1. 2

        That’s exactly the point. We can’t rely on compilers to do it even when we think they should. But designing the system correctly from the start frees us from having to do the guesswork of hitting the compiler’s optimization passes.

      2. 1

        I suspect that it’s because some optimisations are hard/brittle to “pattern-match” due to the low-level nature of LLVM’s IR (e.g. some front-end might elide important type/shape information when compiling to the IR or some optimisation are simply too costly to track without blowing the compile time). The goal of MLIR is to provide tools that allow creating composable IRs where each IR might contain specific information where optimisation are easy to detect and apply.

        Luckily @david_chisnall often comment here and might shed light on this.

    11. 1

      Waiting for the inevitable dethronement of Slack and Discord by Ripcord. I jest, I wish faster tech won a majority of the time, but I’ve seen enough to know that’s just not the case.

    12. 1

      Mobile Safari can’t make a secure connection - server issues? Cert incompatibility? - so: Wayback Machine mirror. (the “cached” link by the story hadn’t gotten a copy yet.)