1. 63
  1.  

  2. 10

    I write a fair bit of Rust and have mostly ignored binary size as a thing to optimise for so far but I’d like to learn more. Assuming the extra size ends up as executable code and not just stuff in the data segment, I’m curious, what are the drawbacks to a bigger binary/more generated code? One possible reason that come to mind are less efficient use of CPU caches if code is more “spread out”. Are there RAM consumption consequences of a larger binary?

    1. 15

      There’s also the wasm case, which is going to become increasingly important.

      The most interesting single feature of the feedback from this post is the extremely wide variance in how much people care about this. It’s possible I personally care more just because my experience goes back to 8 bit computers where making things fit in 64k was important (also a pretty long stretch programming 8086 and 80286). Other people are like, “if it’s 100M executable but is otherwise high quality code, I’m happy.” That’s helping validate my decision to include the localization functionality.

      1. 5

        My experience with Rust binary size has let me to assume that it’ll be a non-starter in WASM, which is a pity. It’s going to be hard to compete with JS for startup time.

      2. 5

        There’s a finite-size instruction cache. You don’t have to fit the whole program in it, but hot-looping tasks should.

        At least for inlining, the compiler is supposed to understand the tradeoff between code size and performance.

        The biggest bloat potential comes from large generic functions used with many different parameters. For each type you get new copy of the function, so you may end up with 8 versions of the same function. cargo-bloat is useful for finding these.

        Libstd has this pattern:

        fn generic(arg: impl AsRef<Foo>) {
            non_generic(arg.as_ref());
        }
        
        fn non_generic(arg: &Foo) {…}
        

        this way even if you call generic with 10 different types, you still get 1 non_generic copy, and the rest will probably inline down to nothing.

        1. 5

          I often heard critique for C++ that classes are “slow” because of dynamic dispatch. Not sure if it’s true problem. Rust does monomorphization by default and it’s critiqued for bloated binaries. It’s not easy tradeoff. Also, it looks like dynamic dispatch with dyn trait objects is considered somewhat “non-idiomatic”.

          I think at-runtime features such as trait objects and Rc should be more widely used, at least in application code.

          1. 6

            Exactly. Rust has the reverse problem. There’s a hyperfocus on never doing dynamic dispatch and I feel it’s underused. I have yet to see a performance problem directly related to trait object outside of a narrow set up programs, mainly synthetic benchmarks.

            Also, some libraries tend to be hypergeneric, which only allows you to roll out the code when assembling the main program, hampering early compilation in crates.

          2. 1

            The biggest bloat potential comes from large generic functions used with many different parameters.

            In this case is there a drawback aside from the larger binary? I.e. is there a runtime performance impact of the larger binary?

            1. 5

              If every single program on a system requires hundreds of megabytes things just become unwieldy. Cutting waste is good, pointlessly large programs waste bandwidth everywhere from disks, ram and internet.

              Ever wonder why windows update takes minutes instead of seconds? I often do…

              1. 4

                Cool, I can get behind that. Just trying to work out if the primary motivation is disk use or something else.

                1. 3

                  Disk, but also things like your CPU’s instruction cache is a scarce resource. You want a hot loop’s instructions to fit entirely in it, so each iteration of the loop doesn’t involve re-pulling the loop instructions from a slower level of memory.

                  Excessive bloating from a lot of calls to different versions of monomorphized functions and inlining could potentially mean that that kind of important code won’t fit in the cache.

              2. 1

                I haven’t looked for or seen measurements showing (or not showing) this, but one assumes that this could result in more frequent instruction cache misses if you are commonly using the different monomorphized versions of the generic function.

          3. 6

            A lot of this is the lack of dynamic libraries. Dynamic libraries were made to solve this very problem, back in the early 90’s or whatever when suddenly you wanted to run multiple programs which each included a 500kb GUI library on a workstation with 4 MB of RAM. Rust’s “static by default” solves a lot of problems that dynamic libraries then cause, like “how do you put generic data structures in a dynamic library” (answer: you don’t). We have gigs of memory so usually I don’t care. It’s a problem in two use cases: Embedded, and wasm. Those are worth pursuing though.

            rustc does have an “optimize for size” option, it’s just not very well advertised. To pick a random, complex example I tried the ggez bunnymark example on my desktop:

            • Debug: 60 fps at 100-200 bunnies, binary is 95 MB (!). Bit excessive, I gotta say. Stripped, it’s 15 MB.
            • Release: 60 fps at 12,000+ bunnies, binary is 7.5 MB.
            • Release with opt-level="s": 60 fps at 10,000 bunnies, binary is 7.3 MB
            • Release with opt-level="z": 60 fps at 5500 bunnies, binary is 4 MB

            Not world-changing, really, but it is better. cargo-bloat is also an awesome tool, I should use it more.

            1. 5

              I installed Rust Analyzer and while it works well it took roughly forever and takes 2.3G of space. Sure makes Eclipse seem svelte.

              1. 3

                To clarify, 2.3G is the size of the target dir for the rust-analyzer after building from source? The size of the binary should be much smaller, but is still gigantic (200M), mainly because we compile with debuginfo.

                I’ve actually did a couple of debloating PRs recently, but it’s still not exactly lean :(

                1. 3

                  Yeah, that’s how much storage was taken when I ran cargo install-ra.

                  The binary is a giant 200M and it takes 450M of ram (1.5G of address space) when running over my quite small toy project.

                  Meanwhile VIM is taking 10M of ram (more than twice as much as my first Linux workstation had!) and the whole install (split across packages vim, vim-runtime and vim-common) is about 30M.

                  gcc and g++ (version 6.3 which is what’s on my Debian stable) are each under 1M.

                  1. 1

                    gcc and g++ (version 6.3 which is what’s on my Debian stable) are each under 1M.

                    Is gcc really that lean, or are these executables launchers for other ones, or dynamically-linked with the rest of the compiler? My gcc executable is just 18KB in size, but gcc-9 install dir takes 290MB.

                    1. 5

                      GCC calls multiple executables; the main one is cc1 and squirrelled away in /usr/lib somewhere. It is 26 MB on my machine (gcc 8), plus a couple megs of shared libraries.

                      Clang is 30 MB, but includes the 60 MB libLLVM-7.so. Few people complain about C++ making bloated binaries though.

              2. 4

                Performance also enables new behaviour and new possibilities. BubRoss gives some very good examples—here are a few others.

                • Git made branching and merging so quick that rather than development being trunk-based as it was common in the Subversion days, developers create dozens of branches for new features, bug fixes, refactors, etc. Git’s speed changed the way we use VCS.
                • Gmail’s search is so fast that, even with gigabytes of emails, it’s faster and more convenient to search rather than organize emails in very structured folders.
                • Go’s compiler is fast enough that debug builds without optimizations are not a thing; you execute go run and in less than a second your program is running.
                • SQLite is small enough and fast enough that I’ve seen people use it as an ETL tool: load data in SQLite, use it to transform the data, then dump it out in CSV.

                As BubRoss said, if a program is slow and there are alternatives, people will use the alternatives. And if the fast alternative program enables new behaviours, then the old program will likely die out.

                1. 3

                  I agree with the point but half those examples are really bad. Google Maps beat Mapquest not because it was less bloated but because you could move the map in real-time without reloading the whole webpage. Chrome didn’t succeed where Firefox failed by just being faster than IE but also by buying bundling deals with manufacturers. Google Search didn’t win by being faster (most search engines were pretty fast back then) but by delivering relevant results,

                  When things aren’t equal users do sometimes choose bloated features over speed. For example iTerm performs much worse than Terminal but every developer uses it. VSCode is wildly popular despite much faster alternatives like Sublime Text. Etc.

                  1. 2

                    I’ve been countering the myth, too. I just think this person did a better job by using a lot of market-grabbing examples that won in this way.

                    1. 3

                      Agree vehemently. Most of my experience with people arguing about perf is them misquoting Knuth (”Optimization is the root of all evil in programming.”).

                      In reality, the process for building software should be:

                      • make it correct
                      • make it simple
                      • make it fast

                      Preferably follow that order but you should Perform every step.

                      1. 3

                        “Premature optimization is the root of all evil” is such a misunderstood quote. The full context really highlights what it’s about:

                        Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.

                        The noncritical parts section is important: we shouldn’t spend time (initially) making fast the parts of the program that are seldom called, but we should spend the time necessary to make the core functionality fast. The small efficiencies bit is also important: we can let small efficiencies go by, but we probably shouldn’t let big inefficiencies go by.

                        1. 2

                          I get a lot of goodwill at my current job by doing that third step for our user-facing software. Most of what I do is trivial profiling and optimizations. A lot of goodwill.

                          1. 1

                            I’ll add the Richard Gabriel principle to get it 90% of the way to correct… works well enough… before correct. Maybe that, simple, and fast before correct.

                            1. 2

                              Ya - ”correct” is a pretty overloaded word. Sometimes you can simplify, removing unnecessary bits which allows you to get to ”correct” faster.

                        2. 2

                          Personally I think categorizing crates would require a centralized authority to governing the labels. Instead, why dont just add a dependencies complexity-analysis into cargo itself and have the report print out by default when somebody add new dep into a project.

                          1. 2

                            One recent case we saw a similar tradeoff was the observation that the unicase dep adds 50k to the binary size for pulldown-cmark.

                            This is certainly not the failure of library authors.

                            Providing such data, along-side timezone information, and a few others “databases” is the responsibility of the operating system.

                            Operating system vendors failing to ship with these things out of the box are the ones to blame.