1. 142
  1. 69

    So, I love Rust, and all of the nice things they say about Rust are true. Having said that, I’m now going to completely ignore the Rust angle and focus on something else that occurred to me.

    To summarize Discord’s problem - after extensively optimizing a Go service to produce minimal garbage, they found that the garbage collector was still triggering every two minutes (which appears to be a hard minimum frequency) regardless of the lack of garbage produced. Further, each GC sweep was expensive because it had to walk the entire heap full of live objects.

    The interesting point to me is that this use case (latency-sensitive service with extremely low rates of garbage production) was pathological for a tracing GC, and that optimizing it to produce less garbage made it even more so. Tracing collectors operate on every live object and ignore dead objects, so a heap full of live objects and very few dead ones is a bad fit for a tracing collector. They solved their problem by switching to a reference counting system (well, “reference counting” where everything has exactly one owner and so you don’t actually need to count). Reference counting ignores live objects and operates on dead ones, so of course it would be a better fit for this service. If Go had a collector based on reference counting they probably could have gotten much of the same benefit without rewriting.

    This reminded me of “A Unified Theory of Garbage Collection” by Bacon et. al, but it hadn’t occurred to me before how optimizing the app to produce less garbage could make the GC’s job harder in some ways. It’s still better to reduce garbage production than to not do so, but it may not give as much benefit as one might expect because of this.

    1. 3

      They solved their problem by switching to a reference counting system (well, “reference counting” where everything has exactly one owner and so you don’t actually need to count). Reference counting ignores live objects and operates on dead ones, so of course it would be a better fit for this service.

      Aside from your wider point, it’s a little more subtle than that, because of abstractions like Rc which give a counted reference to a value, meaning multiple references. There’s also Arc which is an atomic reference counter for use in multiple threads. The first simple Rust program I wrote, I was guided to using Rc, so it’s not even uncommon. Without seeing their code, I’m willing to bet there are plenty of such cases in their code.

      1. 6

        The first simple Rust program I wrote, I was guided to using Rc, so it’s not even uncommon.

        Do you mind sharing what you were trying to do? I’ve been writing Rust for a long time now, and I can count on one hand the number of times I’ve needed Rc. I’ve used Arc a fair number of times though. Still, I’d probably call both pretty uncommon. But there are certainly types of programs where they may be more common.

        1. 4

          I’m currently making a game engine in Rust (rewriting my old Idris stuff) and I use it all the time, from day one. Some of it may be due to the problem at hand necessitating it, but some of it is surely my lack of experience in Rust. I think some of the problems might be solved with a more refined use of lifetimes… but I’ve been burned by structs+lifetimes before so I’d rather opt for something I have a better grasp of even if it’s more inelegant a solution.

          For example, my game has a Timeline object, which is basically the central source of truth about important game data (stuff that has to be saved). But it’s not a regular field, it’s an Rc, because I need to share it with Scene objects (which actually run the game logic). I could make a complex messaging system to telegraph the state changes between Server and multiple Scenes but again… I don’t really wanna.

          1. 2

            Yeah I’ve never made a game engine, so it’s hard for me to know whether Rc is truly beneficial there. In any case, I’m mostly just trying to push back against the notion that reference counting is common in Rust code. (And specifically, Rc.) I was just very curious about the “first simple Rust program” that someone wrote where they were guided towards using Rc.

            This is important because if reference counting were very common, then that would diminish the usefulness of the borrow checker. e.g., “What good is the borrow checker if you wind up needing to use reference counting so much?” Well, you don’t wind up needing to use reference counting a lot. There are of course many cases where reference counting is very useful, but that doesn’t mean it’s common among the entire body of Rust code.

          2. 1

            Just as an off-hand example from my experience: you basically can’t get anything done with GTK and Rust without Rc. Cf. https://github.com/bitemyapp/boxcar-willie/blob/master/src/main.rs#L108

            I wrote boxcar-willie with assistance from the gtk-rs people.

            Some common web-app stuff will force you into that too.

            There are other situations and libraries that force it but these are the ones that come to mind from my own background. GUI apps and web apps already touches >80% of programmers.

            1. 2

              What part of web apps use Rc in Rust? There is more nuance to this. A better metric might be “of all the types you define in your project, what proportion of them use reference counting?” If you have to have one reference counted type among dozens of other non-reference counting types, then I’d say it’s pretty uncommon. For example, if most web apps have a database handle and then database handle uses an Arc to be efficiently shared between multiple threads simultaneously, and since database handles are pretty common in web apps, would you then conclude that “reference counting is common” in Rust? I don’t think I would. Because it’s doesn’t pervade and infect everything else in your code. There’s still going to be a lot of other stuff that doesn’t use reference counting at all.

              The GTK case is indeed known, and was on my mind when writing the above comments. But it’s not clear to me that this is a GTK problem or whether it generalizes to “GUI apps.”

              1. 1

                Well usually it’d be an Arc, particularly in cases where the framework doesn’t provide a way to share data between request handlers.

                I was just proffering where I’d run into it. I’m not trying to make some kind of polemical point. I rather like using Rust.

                then database handle uses an Arc to be efficiently shared between multiple threads simultaneously, and since database handles are pretty common in web apps, would you then conclude that “reference counting is common” in Rust?

                I’m speaking to peoples’ subjective experience of it and how they’re going to react to your characterization of it being rare. We’re not taking a pointer head-count here. You get someone comfortable with but new to Rust and have them spin up a few typical projects they’re going to say, “but I kept running into situations where I needed ${X}” and it doesn’t feel rare because it occurred at least a couple times per project. I’m personally very comfortable and happy with the performance implications of a handful of reference-counted pointers and everything else being automatically allocated/de-allocated on the stack or heap. That being said, if you use the wording like you used above, you’re going to continue to elicit this reaction if you don’t qualify the statement.

                Edited follow-up thought: I think part of what’s needed here perhaps is an education push about Rc/Arc, their frequency in Rust programs, when and why it’s okay, and how it isn’t going to ruin the performance of your program if a few are floating around.

                1. 2

                  My initial reaction was to the use of Rc. If they had said Arc, I probably would not have responded at all.

                  1. 1

                    I apologize for communicating and interpreting imprecisely. I mentally glob them together.

                    I think GTK is in fact the only time I’ve really used Rc. Everything else has been Arc I’m pretty sure!

        2. 1

          The interesting point to me is that this use case (latency-sensitive service with extremely low rates of garbage production) was pathological for a tracing GC, and that optimizing it to produce less garbage made it even more so. Tracing collectors operate on every live object and ignore dead objects, so a heap full of live objects and very few dead ones is a bad fit for a tracing collector.

          I wouldn’t draw a general conclusion from the behavior of the Go garbage collector.

          Optimizing a system to produce less garbage is a standard optimization technique for the JVM and .NET.
          It is effective on these platforms because they both use generational garbage collectors.
          Long-lived or large objects are traced rarely or not at all.
          The LMAX Disruptor, for example, allocates all memory at startup.

          This technique isn’t effective for Go because the Go garbage collector forces a collection every 2 minutes.

          Go uses a conservative non-generational GC.
          Here are some of the tradeoffs of this design:

          • Conservative - objects are never moved in memory and the heap is not compacted.
            Interoperability with C is easier but heap fragmentation is a potential risk.
            Memory usage is lower than with a generational GC.
          • Non-generational - Generational garbage collectors can scan only recently allocated objects while ignoring old large objects. Most objects die young so this is often beneficial.
          • Pause times may be lower than a generational GC.
          • Lower implementation complexity.

          Neither design is right or wrong but I would be leery of using Go for a high performance system.

        3. 41

          This is not a fair comparison. Go 1.9.2 was released over 2 years ago. In that time they have fixed a lot of the GC stutter issues. Comparing rust nightly to a 2 year old compiler is unfair.

          1. 17

            I am very confused why they do not mention why they didn’t upgrade. The Go compiler is intentionally (and will probably always be) backwards compatible. At Google we dogfood the new compiler throughout the fleet before releasing to the public. I have never seen a rollback. The GC has consistently gotten faster through versions.

            I’m not saying what Discord did was wrong (I simply don’t know enough about the underpinnings of GC) but that they didn’t address such obvious low hanging fruit is strange.

            1. 37

              From /u/DiscordJesse on reddit:

              We tried upgrading a few times. 1.8, 1.9, and 1.10. None of it helped. We made this change in May 2019. Just getting around to the blog post now since we’ve been busy.

              https://www.reddit.com/r/programming/comments/eyuebc/why_discord_is_switching_from_go_to_rust/fgjsjxd/

              1. 27

                It’s worth noting that go1.12 was released before May 2019, and the release notes include lines like this:

                Go 1.12 significantly improves the performance of sweeping when a large fraction of the heap remains live. This reduces allocation latency immediately following a garbage collection.

                I believe that directly addresses what they were observing.

                1. 14

                  Seems like the release didn’t come in time for their needs, considering they probably started the work on the Rust port after trying 1.10.

                  1. 5

                    Sure, but the Rust version wasn’t complete or deployed for multiple more months, implying they had that long to upgrade to go1.12, let alone go1.11 or the betas and release candidates. I can count on one finger the number of times upgrading a version of Go caused any problems for me, and the one time it did, it was because I had a bunch of my own cgo bugs to fix, so it’s hard for me to imagine a reason a compiler upgrade would be anything harder than a normal deploy of any change, which they claim they did many of (tuning, etc.) Deciding not to do that because you were already underway on a Rust port is just a sunk cost fallacy, and it’s not like it’s much work or unreasonable to expect people with production Go services to keep up with the two major releases they do per year.

                    That said, I’m operating on incomplete information. I’d like to give the benefit of the doubt so I expect they made a good decision here. They seem happy with the outcome and the end result is better for them on a number of different metrics, so that’s great.

                    It’s just unfortunate that we’ll probably never know if go1.12 would have solved their latency issues for not only them, but perhaps many others. A program reproducibly showing failures (and latency spikes like that are a failure for Go’s GC) is a valuable thing.

                    1. 8

                      It’s likely that they weren’t aware of the fixes, and if they were, the rewrite might have solved other pain points and simplified the code in other ways. The issues they faced with the GC would then just be a catalyst for the rewrite. Not the cause itself.

                  2. 5

                    So in 2016, Pusher switched from Haskell to Go because Haskell had poor GC latency due to a large working set.

                    The fundamental problem was that GHC’s pause times were proportional to the size of the working set (that is, the number of objects in memory). In our case, we have many objects in memory, which led to pause times of hundreds of milliseconds. This is a problem with any GC that blocks the program while it completes a collection.

                    Enter 2020, Discord switches from Go to Rust because of poor GC latency due to a large working set.

                    We kept digging and learned the spikes were huge not because of a massive amount of ready-to-free memory, but because the garbage collector needed to scan the entire LRU cache in order to determine if the memory was truly free from references.

                    So, I guess it’s only a matter of time before Pusher rewrite their service in Rust?

                    1. 2

                      And with the recent improvements to garbage collection in GHC, back to Haskell after that?

                      http://www.well-typed.com/blog/2019/10/nonmoving-gc-merge/

            2. 9

              Fascinating read! While I’m glad that they figured out a way to get what they wanted from their system, this reminds me of an article from Segment (https://segment.com/blog/allocation-efficiency-in-high-performance-go-services/) from 2017. While it’s a case of GC pauses that forced Discord to evaluate, it was a case of a large amount of heap allocations for Segment. I should say they both are related nevertheless.

              However, as part of an escape analysis study, they managed to determine an unnecessary pointer semantic which they didn’t need and also used a different API in the standard library which would take a byte slice rather than creating one and returning it. Ultimately these are tricks for GC avoidance by deliberately choosing to keep the heap as minimal as possible in steady state.

              I’d love to see a similar analysis by Discord although I’d say it might be a fool’s errand in the long run especially if their service is something that’s changing rapidly which would force them to do this exercise more frequently.

              1. 7

                In some languages, you can turn the GC off in the fast path or tell it when to collect garbage. That might have solved this problem. Does Go lack that feature?

                1. 3

                  You can’t disable it I think, but you can tell go to run a GC (with runtime.GC()). If you don’t create a lot of garbage, and find an opportune time to collect garbage not too infrequently, you shouldn’t see unwanted GCs.

                  However, in a server situation, when do you have the opportunity to pause to collect garbage? It’s not like you can just say, “collect if the user hasn’t interacted for some seconds”; the service is probably always busy.

                  1. 5

                    Literally the first result for “golang disable gc” on either DDG or google indicates that turning the GC off is as simple as setting an environment variable, and that it can be manually invoked.

                    I’d wager writing a post based on a guess took longer than finding out would have.

                    1. 3

                      Oh, that’s very interesting. Running with GOGC=off, manually monitoring the process’ RSS and running a GC only when it gets too high would probably have almost solved Discord’s problems then.

                      Still, writing services in Rust (or even C++) just to not have to bother with memory management minutia seems like a good idea for services with high performance requirements.

                      1. 3

                        Rust and C++ don’t absolve you from dealing with memory management minutia; they obligate it. This might make sense if your application has some critical mass of code that requires careful, explicit memory management, but I would put that threshold somewhere above 90%. The more compelling reason to use Rust or C++ is that you need a mature compiler with lots of optimizations, not the memory management considerations. Of course other requirements (e.g., safety, obscure target architecture, etc) might also prompt you to choose Rust over Go.

                    2. 3

                      In a server situation, there’s a few things you might do:

                      1. No GC at all in the hottest path. Do that one manually.

                      2. Memory pools or reference counting if they have less performance impact.

                      3. Activate GC after the hot path of the service call.

                      4. If there’s ups and downs, do the GC’s on the downs. A load balancer or monitor might even tell it to.

                      5. If load balancing w/ multiple nodes, create enough downtime periodically to force a GC.

                      6. If garbage builds slowly, then don’t activate the GC at all. Use a cluster setup, eventually fail-over to another node, and restart the one full of garbage. That flushes out lots of problems for reliability, too.

                      1. 3

                        That last one actually sounds eminently sensible if your application really has that characteristic.

                        Or combine 5 and 6: For each node, when the garbage gets over a threshold, set its health to ‘unhealthy’, wait until connections from the load balancer drain down, run a GC, set health back to healthy.

                        The other thing which comes to mind is they have a massive LRU cache full of teeny counters: perhaps this would be better as an array of bytes instead of an array of garbage-collected number thingies. Or maybe I’m just thinking like a C programmer still :-P

                  2. 7

                    After digging through the Go source code, we learned that Go will force a garbage collection run every 2 minutes at minimum. In other words, if garbage collection has not run for 2 minutes, regardless of heap growth, go will still force a garbage collection. We figured we could tune the garbage collector to happen more often in order to prevent large spikes, so we implemented an endpoint on the service to change the garbage collector GC Percent on the fly. Unfortunately, no matter how we configured the GC percent nothing changed. How could that be? It turns out, it was because we were not allocating memory quickly enough for it to force garbage collection to happen more often.

                    As someone that’s not very familiar with GC design, this seems like an absurd hack. That this 2-minute hardcoded limitation is not even configurable comes across as amateurish even. I have no experience with Go – do people simply live with this and not talk about it?

                    1. 12

                      As someone who used to work on the Go team (check my hats… on the Cloud SDK, not on language/compiler), I would say that:

                      1. It is a mistake to believe that anything related to the garbage collector is a hack. The people I met who worked on it were far smarter than I and often had conversations that went so far over my head I may as well have walked out the room for all I could contribute. They have been working on it a very long time (see the improvements in speed version over version). If it works a particular way, it is by design, not by hack. If it didn’t meet the design needs of Discord’s use case, then maybe that is something that could be worked on (or maybe a later version of Go would have actually fixed it anyway).
                      2. Not providing knobs for most things is a Go design decision, as mentioned by @sanxiyn. This is true for the whole language. I have generally found that Go’s design is akin to “here is a knife that’s just about sharp enough to cut your dinner, but you’ll find it fairly difficult to cut yourself”. When I worked with Java, fiddling with garbage collection was just as likely (if not more) to make things worse it than was to make it better. Additionally, the more knobs you provide across the language, the harder it is to make things better automagicaly. I often tell people to write simpler Go that’s a little slower than complex Go that’s a little faster algorithmically because the compiler can probably optimize your simpler code. I would guess this also pertains to GC, but I don’t know anything about the underpinnings.
                      1. 7

                        One of explicit design goals of Go’s GC is not to have configurable parameters. Their self-imposed limit is two. See https://blog.golang.org/ismmkeynote.

                        Frankly I think it is a strange design goal, but it’s not amateurism. It’s a pretty good implementation if you assume the same design goals. It’s just that design goals are plain weird.

                        1. 14

                          I have no experience with Go – do people simply live with this and not talk about it?

                          My general impression is that tonnes of stuff about Go is basically “things from the 70s that Rob Pike likes”. Couple that with a closed language design team…

                          1. 2

                            It is configurable, though. You can set an environment variable to disable GC and then run it manually or you can just compile your own go with a bigger minimum interval.

                            Either would be a lot less work than rewriting a whole server in rust, but maybe a rewrite was a good idea anyway for other reasons.

                            1. 3

                              or you can just compile your own go with a bigger minimum interval.

                              I’m not sure “rewrite code to change constants then recompile” counts as “configurable”, nowadays.

                          2. 4

                            Maybe an existing or future version of Go would have fixed or mitigated these problems, like some folks are saying here, but the fact that a straightforward Rust implementation performed as well (better in some crucial respects) as the optimised Go implementation seems noteworthy. If you can get the same or better performance, but from simpler code (which doesn’t require you to do non-obvious things just to appease the runtime) that seems like a significant win to me.

                            1. 5

                              People should start considering Nim seriously when they face problems like this one. It offers the performance and memory-handling flexibility without being as difficult as Rust. They’ve gone past 1.0 quite recently.

                              1. 4

                                I’m a big fan of Nim as well, but given that Nim is garbage collected, this seems like a questionable suggestion unless you know it doesn’t suffer the same problems with large working sets as Go does/did. Do you have reason to believe Nim’s garbage collector would behave better than Go’s here?

                                1. 4

                                  Mostly I believe this because Nim allows you to choose the GC at compile time: https://nim-lang.org/docs/gc.html and because of the things I’ve seen about its latencies, e.g. https://gist.github.com/dom96/77b32e36b62377b2e7cadf09575b8883

                                  Also the new “ARC” memory handler seems quite promising too, and something that could give quite easy improvements where memory pressure is a thing. https://forum.nim-lang.org/t/5734

                                  It’s possible of course that none of this is at all relevant to Discord’s case, but Nim seems so promising on paper and so unknown generally that I think there’s some value to bringing it up in cases like this one.

                              2. 3

                                Changing to a BTreeMap instead of a HashMap in the LRU cache to optimize memory usage.

                                Why would a BTree map use less memory in rust?

                                1. 3

                                  Hash maps usually try to have a lot of empty slots for performance, while afaik a BTree will usually just have the necessary nodes, right?

                                  1. 1

                                    I think it’s more about access patterns than just size. In a hash map the elements are essentially randomly jumbled about, whereas In a BTreeMap they can be sorted by time. The oldest elements will be closer together in memory and O(1) to find, which are both great news for a LRU cache.

                                    1. 2

                                      They didn’t say BTreeMap is faster, they said BTreeMap uses less memory. In fact, I very much doubt it was faster.

                                      1. 0

                                        I have no trouble believing it was faster for the workload. After all, that was the point of the article.

                                  2. 4

                                    I wonder if the Discord engineers tried profiling via Go’s native tooling too. I’m not saying it would have solved their problems but in my experience you are able to drill down further into individual hot spots. Granted, Go chooses to only give you a couple of knobs to tune garbage collection. It was an insightful article nonetheless!

                                    1. 4

                                      It’s amazing how often rewrites for fun happen in our industry. If this was any other industry than programming people would be shocked.

                                      1. 30

                                        Reading the article, it does look like they hit some performance issue, due to how the Go GC works, and they tried Rust because it was already seeing adoption at Discord. I would not call that “just for fun”.

                                        On the other hand, a lot of things change (improve?) in our industry because we hack stuff just for fun. :)

                                        1. 7

                                          This happens in other industries too. Youtube is full of metalworkers/MechE’s, EE’s, chemists, woodworkers, and so on hacking their tools, twiddling with silly ideas just for fun, breaking stuff, trying out old ideas in new ways or vice versa, and so on. The only thing special about software is you can do it sitting in front a computer instead of needing a whole workshop or lab.

                                          1. 4

                                            It’s even codified, the car industry has time frames in which a whole car is essentially completely replaced.

                                          2. 3

                                            Even though rather expensive, doing a rewrite is usually not so bad in our industry than it would be in many others. Imagine replacing a bridge with a new one just because you got slightly better steel now than last year.

                                            1. 11

                                              Imagine building a new bridge because there was more traffic than would fit on the bridge you already have, and deciding you might as well use the new steel while you’re at it.

                                            2. 1

                                              What’s the equivalent of rewrites in other industries?

                                              1. 5

                                                Remodeling the kitchen.

                                                1. 1

                                                  My own perspective: if software engineering is like writing then it’s a rewrite or a reimagining. If it’s math then it’s something like string theory or where a trail leads nowhere so you go back and start over from a new fundamental. I think software is mostly writing with a bit of the math in my analogies. I don’t think it’s like physical construction but sometimes I use those analogies.

                                              2. 2

                                                Having spent much of the past couple of days staring at rust compiler output, this is a reminder of how terrible codegen can be and still be “faster” than the alternative.

                                                1. 2

                                                  Are you looking at the LLVM IR generated by the rust compiler or the machine code generated by LLVM? I’d expect the IR to look bad but the machine code to look pretty good (assuming reasonable optimization levels); if the machine code looks terrible after LLVM with optimizations enabled, that’s very interesting.

                                                  Do you happen to have some examples of terrible generated code?

                                                  1. 3

                                                    I was looking at x86_64 assembly and then at LLVM IR to try to understand why the asm wasn’t great.

                                                    In our code there’s a path that comes down to copying bytes from an iterator into a Vec. The code in question is complex but basically it seems that writing into a mut Vec will never trigger LLVM’s memcpy detector.

                                                    So this (where inp: &Vec<u8>) is fast:

                                                      let outp: Vec<u8> = inp.iter().copied().collect();
                                                    

                                                    and this is not:

                                                        let mut outp: Vec<u8> = Vec::with_capacity(inp.len());
                                                        outp.resize(inp.len(), 0);
                                                    
                                                        for i in 0..inp.len() {
                                                            outp[i] = inp[i]
                                                        }
                                                    
                                                2. 0

                                                  TL;DR: “uh-oh, a latency-sensible service shouldn’t be written in a garbage-collected language”.

                                                  1. 15

                                                    Perhaps depends on the language though - Erlang is used in what I assume to be latency-sensitive applications like telephone switches, and it’s GC’d.

                                                    1. 1

                                                      Erlang is a special case with per-process tiny heaps that are collected independently of one another. I’m not aware of another system that uses such a model.

                                                      There are fully concurrent GCs in experimental and proprietary JVMs (Azul and Shenandoah) that exhibit remarkably low pause times even with large heaps. They are very much the exception rather than the rule and to my knowledge require a throughput hit compared to standard VMs or manual memory management.

                                                    2. 3

                                                      Oh, that explains why no one writes such services in Java or JavaScript or C# or PHP or Perl or Python or Ruby or Erlang or LISP or Haskell … only C and C++. And Pascal.