The idea to map the heap inside the process address space multiple times instead of zeroing some pointer bits on every access is rather cool.
Surely I’m not going to be the only one expecting a comparison here between go’s. I’m not really well versed in GC but this appears to mirror go’s quite heavily.
It’s compacting and generational, so that’s a pair of very large differences.
My understanding, and I can’t find a link handy, is that the Go team is on a long term path to change their internals to allow for compacting and generational gc. There was something about the Azul guys advising them a year+ ago iirc.
Edit; I’m not sure what the current status is, haven’t been following, but see this from 2012, look for Gil Tene comments:
This presentation from this July suggests they’re averse to taking almost any regressions now even if they get good GC throughput out of it. rlh tried freeing garbage at thread (goroutine) exit if the memory wasn’t reachable from another thread at any point, which seemed promising to me but didn’t pan out. aclements did some very clever experiments with fast cryptographic hashing of pointers to allow new tradeoffs, but rlh even seemed doubtful the prospects of that approach in the long term.
Compacting is a yet harder sell because they don’t want a read barrier and objects moving might make life harder for cgo users.
Does seem likely we’ll see more work on more reliably meeting folks’ current expectations, like by fixing situations where it’s hard to stop a thread in a tight loop, and we’ll probably see work on reducing garbage through escape analysis, either directly or by doing better at other stuff like inlining. I said more in my long comment, but I suspect Java and Go have gone on sufficiently different paths they might not come back that close together. I could be wrong; things are interesting that way!
Might be. I’m just going on what I know about the collector’s current state.
Other comments get at it, but the two are very different internally. Java GCs have been generational, meaning they can collect common short-lived garbage without looking at every live pointer in the heap, and compacting, meaning they pack together live data, which helps them achieve quick allocation and locality that can help processor caches work effectively.
ZGC is trying to maintain all of that and not pause the app much. Concurrent compacting GCs are hard because you can’t normally atomically update all the pointers to an object at once. To deal with that you need a read barrier or load barrier, something that happens when the app reads a pointer to make sure that it ends up reading the object from the right place. Sometimes (like in Azul C4 I think) this is done with memory-mapping tricks; in ZGC it looks like they do it by checking a few bits in each pointer they read. Anyway, keeping an app running while you move its data out from under it, without slowing it down a lot, is no easier than it sounds. (To the side, generational collectors don’t have to be compacting, but most are. WebKit’s Riptide is an interesting example of the tradeoffs of non-compacting generational.)
In Go all collections are full collections (not generational) and no heap compaction happens. So Go’s average GC cycle will do more work than a typical Java collector’s average cycle would in an app that allocates equally heavily and has short-lived garbage. Go is by all accounts good at keeping that work in the background. While not tackling generational, they’ve reduced the GC pauses to more or less synchronization points, under 1ms if all the threads of your app can be paused promptly (and they’re interested in making it possible to pause currently-uncooperative threads).
What Go does have going for it throughput-wise is that the language and tooling make it easier to allocate less, similar to what Coda’s comment said. Java is heavy on references to heap-allocated objects, and it uses indirect calls (virtual method calls) all over the place that make cross-function escape analysis hard (though JVMs still manage to do some, because the JIT can watch the app running and notice that an indirect call’s destination is predictable). Go’s defaults are flipped from that, and existing perf-sensitive Go code is already written with the assumption that allocations are kind of expensive. The presentation ngrilly linked to from one of the Go GC people suggests at a minimum the Go team really doesn’t want to accept any regressions for low-garbage code to get generational-type throughput improvements. I suspect the languages and communities have gone down sufficiently divergent paths about memory and GC that they’re not that likely to come together now, but I could be surprised.
One question that I don’t have a good feeling for is: could Go offer something like what the JVM has, where there are several distinct garbage collectors with different performance characteristics (high throughput vs. low latency)? I know simplicity has been a selling point, but like Coda said, the abundance of options is fine if you have a really solid default.
Doubtful they’ll have the user choose; they talk pretty proudly about not offering many knobs.
One thing Rick Hudson noted in the presentation (worth reading if you’re this deep in) is that if Austin’s clever pointer-hashing-at-GC-time trick works for some programs, the runtime could choose between using it or not based on how well it’s working out on the current workload. (Which it couldn’t easily do if, like, changing GCs meant compiling in different barrier code.) He doesn’t exactly suggest that they’re going to do it, just notes they could.
This is fantastic! Exactly what I was hoping for!
There are decades of research and engineering efforts that put Go’s GC and Hotspot apart.
Go’s GC is a nice introductory project, Hotspot is the real deal.
Go’s GC designers are not newbies either and have decades of experience: https://blog.golang.org/ismmkeynote
Google seems to be the nursing home of many people that had one lucky idea 20 years ago and are content with riding on their fame til retirement, so “famous person X works on it” has not much meaning when associated with Google.
The Train GC was quite interesting at its time, but the “invention” of stack maps is just like the “invention” of UTF-8 … if it hadn’t been “invented” by random person A, it would have been invented by random person B a few weeks/months later.
Taking everything together, I’m rather unconvinced that Go’s GC will even remotely approach G1, ZGC’s, Shenandoah’s level of sophistication any time soon.
For me it is kind of amusing that huge amounts of research and development went into the Hotspot GC but on the other hand there seem to be no sensible defaults because there is often the need to hand tune its parameters.
In Go I don’t have to jump through those hoops, and I’m not advised to, but still get very good performance characteristics, at least comparable to (in my humble opinion even better) than for a lot of Java applications.
On the contrary, most Java applications don’t need to be tuned and the default GC ergonomics are just fine. For the G1 collector (introduced in 2009 a few months before Go and made the default a year ago), setting the JVM’s heap size is enough for pretty much all workloads except for those which have always been challenging for garbage collected languages—large, dense reference graphs.
The advantages Go has for those workloads are non-scalar value types and excellent tooling for optimizing memory allocation, not a magic garbage collector.
(Also, to clarify — HotSpot is generally used to refer to Oracle’s JIT VM, not its garbage collection architecture.)
Thank you for the clarification.
I had the same impression while reading the article, although I also don’t know that much about GC.
This is really cool work. I couldn’t ignore the nagging feeling, though, that the GC team is in a race against time, as people are forced to move to things like openJDK because of licensing issues. It must be frustrating as hell to work on awesome stuff and have your parent corporation undermining you as you go.
ZGC is part of OpenJDK, isn’t it?
I could imagine it becoming commercial-only, but I guess RedHat’s move largely prevented that.
Oracle will rather give away its own new GC than have people migrate to the GC of a different vendor, where they lack the experienced developers to fix issues.
What RH move?
It probably refers to Shenandoah, a GC developed by Red Hat with properties similar to ZGC.
Correct me if I’m wrong, but I believe this is going to land in the openJDK, not only the Oracle JRE.
is there a pointer (ah !) to how it all compares to azul’s c4 ?
I’m a bit confused by the mention of putting parts of the heap in slower memory. Shouldn’t that be the kernel + swap’s job? Or does the kernel only page out entire process address spaces (which would clearly be too coarse for what the article’s discussing)?
Linux on x86 and amd64 can page out individual 4kiB pages, so the granularity of that is fine.
It’s plausible that they might be able to get much better behaviour bybydoing it themselves instead of letting the kernel do it. Two things spring to mind:
If they’re managing object presence in user space, they know which objects are in RAM so they can refrain from waking them up when they definitely haven’t changed. Swap is mostly transparent to user processes. You really don’t want to wake up a swapped out object during GC if you can avoid it, but you don’t know which objects are swapped out without calling mincore() for every page, which is not very fast.
Other thing that springs to mind: AFAIK handling page faults is kinda slow and an x86 running Linux will take something like a (large fraction of) a microsecond each time a fault occurs. AFAIK the fault mechanism in the CPU is quite expensive (it has to flush some pipelines) at least. So doing your paging-in in userspace with just ordinary instructions that don’t invoke the OS or the slow bits of the CPU may be a big win.