1. 4

  2. 3

    Should you have enough patience, link-time optimization might come to the rescue. (It’s usually really expensive…

    In my experience it’s not too expensive, and I always enable it in optimized builds. My main current project is C++ and C that compiles to about 2MB on x86-64, and Clang LTO adds maybe 30sec to the clean build time (on my 2019 MacBook Pro.) It’s been a while since I measured, but IIRC it reduces code size about 10% (we build with -Os) and noticeably improves performance.

    Some of the specific benefits I’ve seen, by tracing and disassembly, are:

    • Major functions with only one call site are inlined. There are more of these than you’d think. I can break a complex function into pieces for clarity and know it won’t affect performance, even if some of the pieces end up in other source files.
    • Simple methods like getters/setters get inlined even if their implementation isn’t in the header.
    • The optimizer gets a lot better at telling whether non-inlined functions are pure/const.

    Clang does have an incremental LTO mode that speeds it up. I don’t know how to enable it at the CLI level, but in Xcode it’s just a checkbox.

    1. 3

      LLVM has two modes for LTO, fat and thin. Fat LTO (which is sometimes just called LTO, because it was added at a time when it was the only LTO mode) is very simple in implementation. Each front-end invocation generates an LLVM bitcode file rather than object code. The linker combines all of these and then runs the LLVM back end. There are two big disadvantages with this approach:

      • The LLVM back end is single threaded, so after doing a -j64 build to get the bitcode, you then sit with only one core doing anything during the ‘link’ stage (which is really a compile stage).
      • Any change to any part of the source code requires a complete recompilation.

      The newer mode, ThinLTO, is a bit more clever. Each front-end invocation does some optimisation but remember that an optimisation is a combination of two things: an analysis and a transform. The results of all of the analyses are published along with the partially optimised bitcode. The linker then reads these, merges the analysis results, and then compiles each compilation unit in a separate thread (so you still get parallel compilation). Each optimisation in this stage can see the analysis summaries for other functions (for example, does the called function capture the arguments?) and can also pull in copies of functions from other compilation units if the summary information suggests that they might make good inlining candidates.

      With ThinLTO, some caching is also possible. If a compilation units has not changed and neither has any of the summary information that it used then you don’t need to recompile it. This makes ThinLTO almost as fast for incremental rebuilds as non-LTO in a lot of cases. If you change a function that’s inlined in a load of compilation units (or that a load of compilation units optimise based on some property that you’ve changed) then it will cause all of those compilation units to be recompiled. The caches can also consume quite a lot of disk space.

    2. 1

      As someone who codes and used to inline speed skate, I love the photo choice.