1. 30

  2. 9

    This is very light on the technical details, here’s the source article: https://ieeexplore.ieee.org/document/8695831

    We then demonstrate an end-to-end RTL prototype, integrated into a RocketChip RISC-V System-on-Chip (SoC) executing full Java benchmarks within JikesRVM running under Linux on FPGAs. Our prototype performs the mark phase of a tracing GC at 4.2× the performance of an in-order CPU, at just 18.5% the area.

    Full article: https://people.eecs.berkeley.edu/~krste/papers/maas-isca18-hwgc.pdf

    1. 5

      A big part of the performance cost of GC is simply that it makes it feasible to be lazy about memory. This laziness creeps into language designs: Java assumes boxed objects everywhere, even when they don’t need to be, and the GC will clean things up whenever it gets around to it. C# is better but still has this problem. Functional languages tend to create piles of simple linked lists everywhere by default instead of looking for more efficient alternatives. And so on. GC makes memory allocation trivial from the programmer’s perspective, and solves a LOT of hard problems from the language designer’s perspective, so you just allocate all the memory you want.

      This is part of the legacy of GC’ed languages, because the first mainstream GC’ed language that claimed high performance was Java, and Java was designed in 1995. There were many things we didn’t know about where software development would go and what it would look like when it got there in 1995. Let alone advances in compiler and VM technology. Now we have massive distributed systems in a polyglot of languages talking to each other all over the world, and its much easier to point at certain aspects of a language or VM design and say “we COULD do that but it seems like a bad idea” or “this is a bit of a hack but it works out okay in practice”. But a whole generation of languages are based more or less on the all-the-boxed-objects-we-could-wish-for model of Java (and its own ancestors, such as Smalltalk) and have the same costs.

      1. 5

        One thing I wondered recently is why the “object inlining” optimization isn’t more popular? It seems like the natural complement to escape analysis, which is basically “inlining” objects on the stack. This seems like it would reduce GC pressure by a lot?

        In Go programs, escape analysis is supposed to help reduce GC pressure. As far as I remember they don’t implement generational GC because the stack is sort of like the “young” generation.

        Java and many similar languages lack value types, so the programmer can’t express inlining in the source code, like you can in C / C++.

        There is a 2012 message here that says that Hotspot doesn’t implement object inlining. I know very little about the JVM but I would expect it there if anywhere, simply because Java apps are known to have a lot of garbage.

        I haven’t dove too deeply into this, but it seems like there are not that many modern references to it.




        Anyway I guess the point is that object inlining would ameliorate the effects of certain programming styles creating lots of garbage.

        1. 4

          The JVM does bump allocation, so objects allocated together tend to sit together. My guess is that this somewhat alleviates the benefits of object inlining since nested objects are often on the same L1 or L2 cache page together.

          1. 2

            Yeah I’d buy that for locality, but there’s also the size of the pointers themselves that can be saved. Suppose you have something like:

            struct Landmark {
              String name;  // pointer and length, 12 bytes rounded up to 16
              Point x;
            struct Point { int x; int y; int z; int w; }  // 16 bytes total

            In C or C++, the Landmark object is 32 bytes total. In Java you would have 16 bytes alone for 2 pointers to String and Point, and then another 12+16 = 28 bytes for the data.

            So that’s nearly 50% overhead, which will have a big effect on cache. I made this example up, but I think real Java code is often written like this. It’s idiomatic to compose such small objects rather than packing them for size.

            For example, an ORM probably generates a lot of classes that could be much more compact with object inlining? I don’t have that much experience with Java but I have a hard time seeing how object inlining wouldn’t be a win, considering all the other crazy things that JVM does. Maybe it’s really hard to implement? Or it maybe has something to do with separate compilation?

            Yeah I think my best guess would be that the compilation model makes it hard to bolt on after the fact. The language would have to be designed with it to begin with. inline functions in C headers have similar build-time annoyances.

            edit: There is also the bookkeeping per object you save, i.e. the mark bits. Even if you have cache locality, having fewer objects and thus fewer mark bits still seems like a win in both time and space.

          2. 1

            I assume because it’s a lot easier to reason about the language when everything is a pointer. Being able to say at compile time “everything is either small or a pointer” makes life easy; OCaml does that. Being able to say “I know based on an object’s type whether it is always a pointer or never a pointer” is almost as easy; C# does that. Rust allows all types to be either and so jumps through a lot of hoops because objects can and often are inlined: you end up with non-Sized types, Copy types, trait objects and the limitations on them, etc. From what little I understand of C++‘s memory model, it grapples with a lot of similar problems in terms of copy constructors and whatever else it has. Life becomes even more exciting when, like Rust or C++ but unlike C# or OCaml, you can have a pointer (or reference or whatever) pointing at a sub-member of a struct. I’m sure it’s been done somewhere, but I’ve never seen a GC’ed language that lets you do that.

            I need to learn how Zig deals with such things.

            1. 1

              What you say applies to having value types in a language, but not the object inlining compiler optimization. With object inlining, nothing is changed about the semantics of the language. Everything happens behind the scenes.

              It’s a way to get the benefits of value types – compact physical layout of C structs, which is both smaller and has better locality – without value types in the language. So it only applies to some languages like Java, where everything is a reference.

              It’s just like escape analysis to put objects on the stack, which JVMs and Go do (and I’m sure many more languages.)

              FWIW I printed out the paper and skimmed it. It’s not trivial to do this analysis, and they don’t mention modules, which I imagine are another complication in a real world. There is some interaction with Java’s clone() I believe. That would make sense if you want to exactly preserve the semantics of the language – it might break programs if you clone the fields inlined by the compiler.

              On the other hand, the experimental results seem really good – eliminating 58% of object allocations in a substantial xpdf program. Double digit percentages of fields can be inlined.

              I agree value types make a language a lot more complicated, both for the user and implementer. I never really thought much of it while writing C or C++, but after using other languages I realize that it’s indeed a significant complication. You’re sort of conflating the physical and the logical, whereas I think the holy grail is to get everything logically right and lay it out later. I have mentioned Scala’s late data layout optimization work here before. That’s related to object inlining and in some of the citing work.


          3. 2

            I think there’s low-hanging-fruit for tooling to help programmers with this, to spot patterns in the code which could potentially be reworked into a more efficient (or easier to optimise) form (essentially a form of linting).

            The problem with the “sufficiently smart compiler” idea is that it must work correctly, reliably and reasonably-predictably in all cases. That makes speculating about optimisations more difficult. Tooling is less constrained: its suggestions can be ignored, and it doesn’t even need to suggest a solution (e.g. “these fields often appear together, maybe they should be combined somehow?”).

          4. 1

            This seems like a blurb that reiterates the myth that garbage collected languages are inherently bad for performance, wrapped around a paper that reinvents hardware-assisted garbage collection, an idea that existed since ’80s.

            1. 16

              The fact that an idea has existed since the ’80s should not be grounds for dismissal.

              From the article, regarding prior work:

              While the idea of a hardware-assisted GC is not new [5]–[8], none of these approaches have been widely adopted. We believe there are three reasons:

              1. Moore’s Law: Most work on hardware-assisted GC was done in the 1990s and 2000s when Moore’s Law meant that next-generation general-purpose processors would typically outperform specialized chips for lan- guages such as Java [9], even on the workloads they were designed for. This gave a substantial edge to non-specialized processors. However, with the end of Moore’s Law, there is now a renewed interest in accelerators for common workloads.
              2. Server Setting: The workloads that would have benefitted the most from hardware-assisted garbage collection were server workloads with large heaps. These workloads typically run in data centers, which are cost-sensitive and, for a long time, were built from commodity components. This approach is changing, with an increasing amount of custom hardware in data centers, including custom silicon (such as Google’s Tensor Processing Unit [10]).
              3. Invasiveness: Most hardware-assisted GC designs were invasive and required re-architecting the memory system [8], [11]. However, modern accelerators (such as those in mobile SoCs) are typically integrated as memory-mapped devices, without invasive changes.
              1. 3

                You’re right, I didn’t mean to dismiss the paper, just the article that makes it sound like a novel idea. On the contrary, I’m glad that HWAGC is getting attention at this age.

              2. 3

                Is it a myth? There seems to be a trend here that the non-garbage-collected languages are faster: https://benchmarksgame-team.pages.debian.net/benchmarksgame/which-programs-are-fast.html

                1. 5

                  It’s got a bit of truth in it, in that most garbage collectors commonly used are mediocre at best and our modern processors are optimised towards dynamic allocation.

                  jwz has a good summary of the GC ordeal.

                  1. 3

                    I don’t understand your citation. I have read that article, but what does it have to do with dynamic allocation?

                    Although I generally like jwz’s writing, I’m not convinced by the second citation either, because it’s over 20 years old and basically saying that at that time the only good GCs were in Lisp implementations. Something that surveys the current state of the art would be more convincing.

                    Personally I think GC’s are invaluable, but you can always do better by applying some application-specific knowledge to the problem of deallocation. There are also many common applications and common styles of programming that create large amounts of (arguably unnecessary) garbage.

                  2. 8

                    Garbage collected environments can—and often do—have higher overall throughout than manually managed memory environments. It’s the classic batch processing trade off, if you do a lot of work at once it’s more efficient, but it won’t have the best latency. In memory management terms, that means memory won’t be freed right away. So GCs need some memory overhead to continue allocating before they perform the next batch free.

                    This is one of the reasons iPhones circa 2014 ran smoothly with 1GB RAM, but flagship Androids had 2-3GB. Nowadays both have 4GB, as the RAM needed to process input from high resolution cameras dwarfs the memory concerns of most applications on either phone.

                    Note the latency from batching frees (i.e. garbage collection) doesn’t refer to GC pause time. So called “stop the world” collection phases exist only in mediocre GC systems, like Java’s, because they’re relatively easy to implement. ZGC looks promising, 10+ year old GC technology is finally making it to the JVM!

                    C/C++ also make it easier to control memory layout and locality, which massively improves performance in some cases. But as jwz points out in the post linked by @erkin, GC languages could provide similar tools to control allocations. No mainstream GC language does, probably since the industry already turns to C/C++ for that by default. Unity comes close, they made a compiler for a C# subset they call HPC#, which allows allocation and vectorization control. And I think D has some features for doing this kind of thing, but I wouldn’t call D mainstream.

                    1. 5

                      Chicken Scheme, uses stack for the ‘minor’ garbage collection.

                      https://www.more-magic.net/posts/internals-gc.html “… This minor GC is where the novelty lies: objects are allocated on the stack. … “

                      Stack memory, by itself is typically a contiguous block (an well optimized part of virtual memory). I think in general, the trend in GC is to have ‘categories’ of memory allocation models, where the optimizations is done per category. The determination of categories is done at compile time, with a possibility of a memory allocation to graduate (or move) from one type/category to another (not using ‘categories’ in algebraic sense here).

                      1. 1

                        I’m not sure I buy that argument, because GC inherently does more work than manual memory management. It has to walk the entire heap (or parts of it for generational GC), and that’s expensive. You don’t need to do that when manually managing memory. Huge heaps are still an “unsolved problem” in GC world, but they aren’t in manual world.

                        I’m a fan of GC, but I think the problem is that you can only compare the two strategies with identical/similar big applications, because that’s where GC really shines. But such pairs of applications don’t exist AFAICT.

                        1. 4

                          It has to walk the entire heap

                          No. There are many more interesting strategies than simple generational heaps.

                          You don’t need to do that when manually managing memory

                          Instead you just call malloc and free all the time, functions that lock, can contend with other threads, and needs to manipulate data structures. Compare to alloc in a GC world: atomic increment of arena pointer.

                          Huge heaps are still an “unsolved problem” in GC world

                          Azul’s GC manages hundreds of GBs without any global pauses. Huge enough for you? ZGC uses similar strategies, copying many of those ideas that have been around since 2008 (to my knowledge, likely earlier).

                          Unimplemented in common open source languages != unsolved.

                          But such pairs of applications don’t exist AFAICT

                          Compare IntelliJ and Qt Creator. Only it’s not a fair comparison, since IntelliJ runs on Oracle / OpenJDK Java, and that JVM still doesn’t have any good garbage collectors (ZGC is still in development / who knows if it will ever actually ship).

                          P.S. I love all the work you’re doing on Oil, and your blog posts are fantastic!

                          1. 4

                            Eh your reply reads very defensively, like I’m attacking GC. In fact most of my posts on this thread are arguing for GC for lots of applications, and I plan to use it for Oil. (Thanks for the compliment on the blog posts. I’ve been on a hiatus trying to polish it up for general use!)

                            I’m just pointing out that GC has to do work proportional to every objects on the heap. What are the other strategies that don’t need to do this? Whether the GC is generational, incremental, or both, you still have to periodically do some work for every single heap object. I think the article points out that incremental GC increases the overhead in exchange for reducing pauses, and I saw a recent retrospective on the new Lua GC where they ran into that same issue in practice.

                            (In Go, I think it’s a bit worse than that, because Go allows pointers to the middle of objects. Although Go also has escape analysis, so it’s probably a good engineering tradeoff, like many things Go.)

                            I also think you’re comparing the most naive manual memory management with the most sophisticated GC. Allocators in the 90’s weren’t very aware of threads but now they certainly are (tcmalloc, jemalloc, etc.)

                            Many (most?) large C applications use some kind of arena allocation for at least some of their data structures. For example, the Python interpreter does for it’s parse tree. It’s not just malloc/free everywhere, which is indeed slow.

                            Anyway, my point is that the batch processing argument tells some of the story, but is missing big parts of it. It’s not convincing to me in general, despite my own preference for GC.

                            Unsurprisingly, the comparison is very complex and very application-specific. It’s hard to make general arguments without respect to an application, but if you want to, you can also make them against GC because it fundamentally does more work!

                            1. 5

                              I don’t think you’re attacking GC, but I think you’re wrong about performance. Comparing jemalloc to G1GC, and similar generational collectors in common dynamic languages, you’re spot on. Generational collection shines in web application servers, which allocate loads of young garbage and throw it all away at the end of a request. But for long lived applications like IntelliJ it’s not so hot. Last I checked (which was a while ago) IntelliJ’s launcher forced the concurrent mark and sweep GC. Stuff like Cassandra and Spark are really poorly served by either of those GCs, since neither prevents long pauses when the heap is really large.

                              Batching pretty much does cover the argument. Assume you have a page of memory with dozens of small object allocations. Which is faster, individually calling free on 90% of them, or doing a GC scan that wastes work on 10% of them? As long as you can densely pack allocations, GC does very well.

                              Arenas are certainly a big win, but in many ways a page in the young generation is like an arena. Yes, manual memory management will always win if you perform all of your allocations in arenas, freeing them in groups as the application allows. Commercial databases do so as much as possible. gRPC supports per-request arenas for request and response protos, a clever answer to the young generation advantage for short stateless request / response cycles.

                              I might be wrong, but I don’t think you’re considering that GCs can use MMU trickery just as much as anyone else. Suppose instead of periodically scanning the old generation, you occasionally set 2mb hugeages of old objects as no read no write. If it’s accessed, catch the segfault, enable the page again, and restart the access. If it’s not accessed after a while, scan it for GC. Having access statistics gives you a lot of extra information.

                              Now imagine that instead of the random musings of some internet commenters, you have really clever people working full time on doing this sort of thing. There’s a whole world of complex strategies enabled by taking advantage of the memory virtualization meant to emulate a PDP-11 for naive programs. Normal incremental collection has more overhead because the GC has to redo more work. ZGC avoids lots of this because it maps 4 separate extents of virtual memory over the same single contiguous extent of physical RAM, and the GC swaps out pointers it’s looking at for different virtual addresses to the same physical address. Trap handlers then patch up access to objects the GC is moving.

                              The whole “GC is linear wrt total heap size” conventional wisdom is a myth, perpetuated by mainstream GCs not doing any better.

                              [GC] fundamentally does more work!

                              It’s still a win for a GC if executing more instructions results in fewer memory accesses. CPUs are wicked fast, memory is slow. Back when I worked on a commercial database engine, loads of our code paths treated heap memory like most people treat disk.

                              1. 1

                                The whole “GC is linear wrt total heap size” conventional wisdom is a myth, perpetuated by mainstream GCs not doing any better.

                                Which GCs avoid doing work linear in the # objects on the heap? Are you saying that ZGC and Azul avoid this?

                                This isn’t something I’m familiar with, and would be interested in citations.

                                I still think you’re comparing the most advanced possible GC with the status quo in manual memory management, or more likely lots of old C and C++ code. (There is apparently still work on manual memory management like [1].)

                                The status quo in GC is basically what JVM, .NET and various dynamic language VMs have.

                                But I’m interested to hear more about the advanced GCs. How do they avoid doing work for every object?

                                The permanent generation is something I’ve seen in a few designs, and the idea makes sense, although I don’t know the details. I don’t think it’s free, because whenever you have generations, you have to solve the problem of pointers across them, e.g. with write barriers. Write barriers incur a cost that doesn’t exist in manual management schemes. How much I don’t know, but it’s not zero.

                                I’d rather see concrete use cases and measurements, but if we’re making general non-quantitative arguments, there’s another one against GC.

                                [1] https://news.ycombinator.com/item?id=19182779

                                EDIT: To see another reason why the batch argument doesn’t work, consider Rust. Roughly speaking, the Rust compiler does static analysis and inserts deallocation at the “right” points.

                                It does not “batch” the deallocation or allocation, as far as I know. Are you saying that Rust programs are slower than the equivalent garbage collected programs because of that? Maybe they would be if marking the heap were free? But it’s not free.

                                Anyway, my goal isn’t to get into a long abstract argument. I’d be more interested in hearing about GC strategies to minimize the total overhead of scanning the heap.

                                1. 2

                                  Which GCs avoid doing work linear in the # objects on the heap?

                                  There are lots of ways to skip work. As you said, the Go GC handles pointers to the middle of objects. Using similar principles a GC can handle pointers to a page of objects, and free the whole page together. You also mentioned Go’s escape analysis at compile time. Do the same escape analysis and count the number of escaped objects for an entire region of memory, dump it when it hits zero. Mark regions that contain external references: if a page never had objects with pointer fields, or if all the pointer fields reference within the page, why scan its pointers before releasing the memory?

                                  I still think you’re comparing the most advanced possible GC with the status quo in manual memory management, […] The status quo in GC is basically what JVM, .NET and various dynamic language VMs have.

                                  I’m refuting your claims that “GC inherently does more work than manual memory management” and that large heaps are an “unsolved problem.” Large heaps aren’t unsolved. GC doesn’t “inherently” do more work. And regardless, number of operations doesn’t equal efficiency.

                                  It does not “batch” the deallocation or allocation, as far as I know. Are you saying that Rust programs are slower than the equivalent garbage collected programs because of that?

                                  Of course Rust programs aren’t always slower. But garbaged collected programs aren’t always slower either, that’s my entire point here.

                            2. 3

                              Instead you just call malloc and free all the time, functions that lock, can contend with other threads, and needs to manipulate data structures. Compare to alloc in a GC world: atomic increment of arena pointer.

                              I’m not an expert in GC theory but this looks like a bold statement to me. I think that efficient allocation schemes in multithreaded environment are both doable and rather common, and I think that memory allocation in most GC implementations is far more expensive than a single atomic pointer increment. I totally understand that in some cases, state-of-the-art GC can match the performance of manual memory allocation, but I have yet to see a proof that GC is always better than no GC.

                              1. 4

                                I’m not saying GC is always better, just that it can be, and often is. Plenty of C/C++ code does ordinary per-object malloc and free, especially C++ when virtual classes come into play. For those applications I claim a sufficiently advanced GC would have higher throughput.

                                As I discussed in my other comment, you can optimize manual memory code to only allocate into arenas, and free arenas at exactly the right time. Having domain knowledge about when an arena can be freed will certainly be better than the GC guessing and checking.

                                allocation in most GC implementations is far more expensive than a single atomic pointer increment

                                Correct. But I also claim most mainstream GC implementations are mediocre. If the environment doesn’t support object relocation by the GC, the allocator needs to fill gaps in a fragmented heap. When the GC can compact a fragmented heap and release large chunks of contiguous memory, the allocator barely has to do anything.

                        2. 4

                          The problem is that such benchmarks usually compare hand-tuned, optimized-to-death code examples, which are simply not representative of the way code is written in the wild.

                          If you have unlimited time to tune and optimize, the fewer amenities the language/runtime has, the better, so non-GC-languages will always have an edge in this scenario.

                          BUT: 99.9% of the time, this is not the scenario anyone is operating under.

                          1. 2

                            I don’t think it’s a myth either – the issue is more complex than than that. It depends a lot on the application, etc.

                            But I also don’t think the benchmarksgame is a strong indication either way, because those programs are all 10 or 100 lines long. You can always do better on small programs without garbage collection – i.e. by applying some problem-specific knowledge to the code.

                            Large applications are where garbage collection shines, because I think the number/probability of memory bugs like use-after-free or memory leaks scales nonlinearly with application size. But that’s exactly where it’s hard to do any meaningful comparison, because we don’t have 2 independent implementations of large applications.

                            My suspicion is that GC / manual memory management isn’t really a useful variable to separate good and bad performance. It’s kind of like how the language doesn’t have too much bearing on how good an application is. There are good and bad apps in C, in Java, in Python, in Lisp, etc. The difference between good and bad apps is very large, and whether they use garbage collection probably isn’t the most salient factor. You can be very sloppy about memory with GC, or you can be disciplined.

                            Also, FWIW I have been making a list of C and C++ programs that use some kind of garbage collection / automatic memory management. So far I have

                            If anyone knows of others, I’m interested! I’ve been thinking about garbage collection in Oil’s implementation (both for its own structures and user structures, but more of the former right now.)

                            1. 3

                              You are incorrect about that GCC link. One, it links to a very old version of the docs, ore modern docs are here: https://gcc.gnu.org/onlinedocs/gcc-9.1.0/gcc/Garbage-Collection.html Two, it IS a feature for the programs it compiles, using the Objective C runtime.

                              1. 2

                                Hm it looks like I have the wrong link, but GCC does do some garbage collection?


                                I’ve never worked on GCC, but I think I read a comment on Hacker News that said it used GC which led me to Google for it.

                                Yeah I just checked the source and this looks like it’s in the compiler itself and not the runtime library:

                                ~/src/languages/gcc-9.1.0/gcc$ wc -l ggc*
                                  1018 ggc-common.c
                                   322 ggc.h
                                   118 ggc-internal.h
                                    74 ggc-none.c
                                  2647 ggc-page.c
                                   525 ggc-tests.c
                                  4704 total

                                This garbage-collecting allocator allocates objects on one of a set of pages. Each page can allocate objects of a single size only; available sizes are powers of two starting at four bytes. The size of an allocation request is rounded up to the next power of two (`order’), and satisfied from the appropriate page.

                                And it looks like it’s used for a lot of core data structures:

                                tree-phinodes.c:      phi = static_cast <gphi *> (ggc_internal_alloc (size));
                                tree-ssanames.c:      ri = static_cast<range_info_def *> (ggc_internal_alloc (size));
                                tree-ssanames.c:  new_range_info = static_cast<range_info_def *> (ggc_internal_alloc (size));
                                tree-ssa-operands.c:      ptr = (ssa_operand_memory_d *) ggc_internal_alloc

                                This isn’t surprising to me because I have found memory management for hueg tree/graph data structures in compilers to be a big pain in the butt, and just this subdir of gcc is 1M+ lines of code. You really do want a garbage collector for that class of problems. However Clang/LLVM appears to use the C++ “ownership” style without GC. It seems very verbose though.

                              2. 3

                                If anyone knows of others,

                                Some game engines have their own garbage collection. I did not use any myself, but check https://wiki.unrealengine.com/Garbage_Collection_%26_Dynamic_Memory_Allocation

                                probably other C++ game engines offer something in this area too.

                                1. 2

                                  Thanks that’s a great example! I found this link to be pretty good:


                                  I think the commonality is that games have huge graph data structures, just like compilers. On the other hand, scientific programs have more flat arrays. And web servers written in C++ have request-level parallelism / isolation, so memory management can be simpler. So yeah it depends on the app, and for certain apps garbage collection is a huge win, even in C++.

                          2. 1

                            I’m mildly confused by the blurb and by the article: this seems like a rehash of old Lisp ideas. I wish the authors would have done a more direct comparison, rather than simply referring to a survey article. I would agree that the computational environment is different today, and is likely to lead to different outcomes in practice.