1. 37

    1. 5

      Does Rust have an equivalent of -Os, i.e. optimize for size? In my experience this helps a lot with over-inlining. When I was at Apple in the earlier days of Mac OS X, the guideline was to use optimize-for-size by default, except in specific places that demonstrably benefited from optimize-for-speed. Smaller binaries are often faster anyway, due to better caching (and less swapping, back then when a lot of Macs were underprovisioned with RAM.)

      1. 7

        Yup, there are —opt-level=s and —opt-level=z

      2. 2

        i-cache exhaustion and the related performance problems are global properties. If I’m reading this article correctly, he’s trying to determine something in a microbenchmark that shows up only in macrobenchmarks. You see quite sharp performance cliffs between things that fit in L1 i-cache and things that don’t. You then see other ones when code doesn’t fit in L2 or when you start hitting TLB misses.

        In a microbenchmark, your code will basically always be in the i-cache for both cases an the non-inlined performance will not suffer from branch-predictor aliasing: You’re always hitting in the branch predictor and it’s always giving the correct prediction. In both cases, you’ll see some small changes from different code layouts (but you may also see similar changes from different compiler versions), you won’t see that you’ve slightly increased i-cache pressure.

        Worse, when you do start getting i-cache misses, they’re very likely to be in a completely different part of the code. your code may be sufficiently hot that it stays in the i-cache, but now something else has fallen out. This is why people warn about i-cache usage from inlining: It’s really easy to keep adding just a little bit of inlining everywhere until you’re getting cache messes on the hot paths. Then your whole system performance sucks but it’s really hard to point to any single cause.

        1. 1

          Do you know a good benchmark that attributes slowdown from inlining specifically to i-cache misses, as opposed to branch-predictor aliasing, increases register pressure, or some other effects?

          1. 1

            I don’t because it’s incredibly hard to write benchmarks for things that are emergent properties of large-scale complexity. In code that I’ve personally profiled I’ve seen a big drop in performance after a small change and used pmc to see a big drop in i-cache misses.

            I’ve also seen big perf drops from rename register exhaustion, which is a really annoying one because it’s very microarchitecture dependent. An xor %rax, %rax on a hot loop got a 50% speedup in one case: the CPU could unroll the loop, but it was speculatively executing everything and so it didn’t know if the value in %rax was live on the way out of the loop so had to keep a copy of it in a rename register. The xor instruction let the CPU use a canonical zero-value rename register as the storage for %rax and so moved the bottleneck somewhere else (not sure if it was execution units or memory bandwidth, but at that point is was well into the fast-enough-don’t-care-anymore category).

            If you really want to see i-cache pressure, talk to some of the Facebook folks that work on HHVM. Their system ends up producing around 300 MiB of instructions in the steady-state case (or did a few years ago - they were actively working on reducing it last time I saw). They have some fantastic data showing the performance cliffs when they exhaust L1 i-cache, L2 cache, L3 cache, and TLB entries for the hot code.

        2. 1

          That’s interesting. I wonder if it would be helpful to mark initialize as cold. Presumably in that case the compiler would try to avoid slowing down the fast path by doing things such as hoisting work above the first check. I don’t know if the compiler would be smart enough to defer the saving of the registers though.

          1. 2

            In terms of the benchmark, cold+inline(always) doesn’t change anything. In the real code, yeah, the function is marked as #[cold] and it does improve some benchmarks.

        3. 2

          That’s probably the main downside of AOT-compiled languages: you have to decide to inline or not during compilation, not during run time, and you can’t inline after dynamic linking. And function calls are relatively expensive.

          I’m curious: are there examples of JIT compilers that can also monomorphize at run time? Probably JITs for dynamic languages do almost that, but are there JITs for static languages that do monomorphization?

          1. 3

            JVM probably does it.

            Dart and TypeScript pretend to have types at “compile” time, and then run in a dynamically-typed VM with hidden classes optimization.

            1. 3

              Note that inlining itself is, in some sense, an AOT specific concept. In a JIT, you don’t need to care about function boundaries at all, you can do a tracing JIT.

              The TL;DR is that you observe a program at runtime and identify a runtime loop: a sequence of instructions that is repeatedly executed. You than compile this loop as a whole. The loop can span multiple source functions/runtime function calls. In each function, the loop includes only the hot path parts.

              So, a powerful JIT can transparently tear through arbitrary many layers of dynamic dispatch, the code is fully monomorphized in terms of instruction sequence.

              What a JIT can’t do transparently, without the help from source level semantics, is optimize the data layout. If a Point is heap allocated, and a bunch of Points is stored in a HashMap, the JIT can’t magically specialize the map to store the points inline. Layout of data in memory has to be fixed, as it must be compatible with non-optimized code. The exception here is that, when an object doesn’t escape, JIT might first stack-allocate it, and then apply scalar replacement of aggregates.

              1. 1

                The exception here is that, when an object doesn’t escape, JIT might first stack-allocate it, and then apply scalar replacement of aggregates.

                JITs can be a bit more clever with escape analysis: they don’t have to prove that an object never escapes in order to deconstruct its parts, they just have to make sure that any deconstruction of an object is never visible to the outside world. In other words, one can deconstruct an object temporarily provided it’s reconstructed at exit points from the JIT compiled code.

                1. 1

                  For dynamic language JIT compilers—e.g. LuaJIT—the compiler has to insert type check guards into the compiled code, right? And other invariant checks. How much do these guards typically cost?

                  I can imagine how a large runtime loop (100s of instructions) could place guards only at the entry point, leaving the bulk of the compiled section guard-free. I can also imagine eliding guards if the compiler can somehow prove a variable can only ever be one type. But for dynamic languages like Lua it could be too hard to perform meaningful global analysis in that way. If you have any insight I’d appreciate it, I’m just speculating.

                  1. 2

                    I am not really an expert, but here’s my understanding.

                    Language dynamism is orthogonal to AOT/JIT and inlining. JITs for dynamic languages do need more deoptimization guards. The guards themselves are pretty cheap: they are trivial predicated branches.

                    As usual, what kills you is not the code, it’s data layout in memory. In a static language, an object is a bunch of fields packed tightly in memory, in a dynamic language, a general object is some kind of hash-map. Optimizing those HashMap lookups to direct accesses via offset is where major performance gains/losses are.

                2. 2

                  I believe C#’s CLR does this, it acts like Java at compile time but then monomorphizes generics at run time.

                  1. 1

                    .NET generics use monomorphization for value types and a shared instantiation for reference types. .NET Generics Under the Hood shows some of the implementation details.

                  2. 2

                    To alleviate this, I think Clang has a feature to profile a program and to use that information to guide the optimisation when you compile it again. This is used by Apple when building Clang itself.

                  3. 1

                    I know Rust hasn’t gotten around to ABI stability yet, but when it does, inline functions exposed from a shared library are problematic. Since the function gets compiled into the dependent code, changing it in the library and swapping in the newer library (without rebuilding the dependent code) still leaves obsolete instances of the inline in the dependent code, which can easily cause awful and hard-to-diagnose bugs. (Obsolete struct member offsets, struct sizes, vtable indices…)

                    For comparison, Swift, which did recently gain ABI stability in 5.1, has some special annotations and rules covering module-public inline functions.

                    1. 4

                      The main problem with library boundaries is not inlined methods but heavy use of polymorphism (without dyn) in most Rust code, because polymorphism is easily accessible and static-dispatch by default. C++ has this issue too (there are even “header-only libraries”), despite virtual methods having dynamic dispatch only. Swift probably inherited Objective C’s tradition of heavy use of dynamic dispatch.

                      Some libraries intentionally limit use of static-dispatch polymorphism, for example Bevy game framework stated it as one of its distinguishing features (however the main concern there is compilation speed, not library updates).

                      1. 8

                        Swift probably inherited Objective C’s tradition of heavy use of dynamic dispatch.

                        Not really: Swift uses compiler heroics to blur the boundary between static and dynamic approaches to polymorphism. Things are passed around without boxing but still allow for separate compilation and ABI stability. Highly recommend

                        1. 3

                          Across ABI boundary it’s still dynamic dispatch. It’s “sized” only because their equivalent of trait objects has methods for querying size and copying.

                          1. 2

                            Hm, I think it’s more of a “pick your guarantees” situation:

                            • for public ABI, there are attributes to control the tradeoff between dynamism and resilience to ABI changes
                            • for internal ABIs (when you compile things separately, but in the same compilation session) the compiler is allowed, but not required, to transparently specialize calls across compilation unit boundaries.
                      2. 2

                        An interesting case-study here is Zig’s selfhosted compiler. Merging the compiler and linker it already allows for partial recompilation inside one compilation unit, including inlined calls.