It does read like the combination of the microbenchmarks and the gc tuning doesn’t let the generational hypothesis apply, which I think is where the author ends up.
Fwiw, 16MB in a 300MB heap does seem like a small nursery to me. Iirc, the hotspot JVM allocates 25% of the total heap to the nursery if you don’t otherwise configure it.
Perhaps they didn’t read the Blackburn, Cheng, and McKinley paper they referenced in the 2nd paragraph?
the best GC is never GCing, just let the OS clean up when you quit
if you can’t afford the address space to never GC (aka your program is too long-running) then the next best thing it to make the bump-ptr nursery as large as you can afford i.e. half the free memory space, so you’re guaranteed to be able to copy everything to the mature generation if it turns out to be all live.
That may work for servers, or devices running a single process. It gets more complicated when you’re one of many processes in memory. What does “half the free memory space” even mean? What happens when five or six processes all try to allocate that much and fill it with objects? How much does GC slow down when a bunch of your nursery has been paged out by the time you collect it?
Deciding the initial size of the heap is one of those engineering questions that has no perfect answer. Make it too large and indeed you cause conflict with other users / tasks, make it too small and you can waste a lot of time on GCs while you slowly grow your heap.
On a machine with 16 GB or 128 GB or whatever I’d suggest that starting with, say, a 64 KB heap would be unnecessarily restrictive. 8 GB would be too much. Pick the answer in between that you are comfortable with. Perhaps do some research and find out what heap size most programs settle down to and allocate that, or something close to it (a half or a quarter, say) right from the start.
I’ve been contributing code to and tuning real-world GCs for about 25 years on a wide variety of devices, from embedded systems with a few KB of RAM, to mid 2000’s “feature phones” with 400 KB to 2 MB RAM (at Innaworks), to modern Android and iOS phones and PCs and servers. I’ve worked for example at Samsung R&D tuning and improving GC algorithms for Android Dalvik/ART and the Tizen DotNet JIT/runtime.
So I haven’t built a system of this sort, and I’m sure there are better choices than what I can come up with in the comments here, but start by allocating a block of memory.
When that block approaches being half full, do a GC, copying to the free space.
If that didn’t free enough memory, ask the OS for more—not a small amount more, but a lot more (maybe double what you had before). If you’re using a “large enough” proprortion of physical memory, don’t ask for more. Continue processing but you’ll probably enter a GC death spiral where you’re repeatedly calling GC without doing much work. When that happens, die.
If you did start swapping…curse overcommit? (You can actually track the amount of time spent in GC and die in this scenario).
I think systems that are willing to overcommit anonymous memory are probably less likely to come configured with disk swap, because it’s less necessary for correct operation in the average case (i.e., when there’s not a run on the bank for pages). Instead you tend to have the OOM killer, a dubious foreclosure mechanism.
In a system without overcommit, an allocation request requires a strict reservation of either physical memory, or somewhere else to be able to promise you’ll be able to store the data so that you’re never caught in a situation that the OOM killer would putatively solve. Classical UNIX things like fork() don’t work very well if you don’t have a bunch of swap to paper over the short term duplication of reservation. Or, you don’t get to make good use of the physical memory you actually have – some of it has to remain fallow for things to work.
MacOS has both overcommit and disk swap. There’s no OOM killer (unlike iOS, which does not swap.) if paging gets excessive or disk space runs too low, a supervisor process starts suspending processes and pops up an alert asking you to select apps to be killed. It’s happened to me a few times and it works pretty well at letting you fix things without having to reboot.
It also has the “compressed memory” business, which I find somewhat fascinating. It’s novel, as far as swap targets go, except that it feels like it would be impossible to assess the capacity of it like you can with, say, a disk partition. I guess the fact that all Mac OS X systems are interactive allows you to do things like the pause and prompt you’re describing.
It isn’t novel, Windows and Linux both have memory compression paging support. In Linux their is both zram and zswap (I’ve only ever used zswap). I even remember seeing on Phoronix that Linux recently got support for using the hardware accelerated compression engine on Intel’s server chips for zswap. Windows has memory compression, the easiest way to tell that it does is it shows how much memory has been saved by compression in task manager.
EDIT: Unless you meant asking the user what apps to kill part, I agree that is novel.
The article also makes the point that nursery size is important. Can a generational collector self-tune in this regard?
e.g. increase/decrease proportional nursery size after every nursery GC depending on proportion of nursery items freed. (High %age => bigger nursery may be useful and vice versa). Of course, this is swapping one heuristic for another, but this one might help tune the system to different kinds of workload?
Some collectors can self-tune, yes. The G1GC for Hotspot has a large number of independent regions, and after each young-gen collection, it can adjust the number of them to try and hit pause and throughput goals (which are the standard way of tuning that collector).
Great article! The scientific method at work. It’s always good to question and revalidate old results, especially since hardware evolution (caches, branch prediction…) can change the “laws of nature”, or ant least physical constants, out from under us.
It sounds to me like the choice of benchmarks could be an issue. I’ve been rereading various GC papers recently; they frequently seem to use compilers, like javac, as benchmarks, which seems reasonable.
It sounds to me like the choice of benchmarks could be an issue. I’ve been rereading various GC papers recently; they frequently seem to use compilers, like javac, as benchmarks, which seems reasonable.
I’d have thought compilers were pretty unrepresentative for GC. They create a big AST as they parse, which exists as long as the program is valid. Then they walk it and generate code, some of which may be shortlived, but something like javac doesn’t really do any optimisation (it used to, then HotSpot spent the first bit of the load having to undo the javac optimisation) and is little more than a streaming write of the AST. Most programs have a much broader set of lifetimes than compilers.
Hans Boehm has a nice anecdote about adding the BDW GC to GCC. In his first version, it made GCC quite a lot faster. It turned out that it was equivalent to just never calling free: the compiler was so short-lived that the threshold for triggering GC was never reached. LLVM learned from this and mostly uses arena allocators, and has an option (on by default in release builds, I believe) for clang not to ever free things. Clangd needs to reclaim memory because it’s a long-lived thing, but most objects in clang are live for most of the duration that the tool runs and so it’s faster to simply let exit be your GC.
For what it’s worth, ZGC and Shenandoah both experienced a large throughput boost after moving to a generational GC (they were already low-latency=bounded pause time), on real world benchmarks as well.
The amount of work copying collectors do in a collect cycle is proportional to the LIVE data only (unlike mark-and-sweep, which is proportional to the size of the heap).
With a copying collector, as long as the percentage of live data is low relative to the overall heap size, the amortized cost will be a lot lower than mark-and-sweep or reference counting. And in practice, the pause shouldn’t be noticeable for most workloads, with large heaps.
Lots of little collections of a nursery aren’t going to amortize as well, because the average percentage of live data at the time of collect will be much higher.
I vaguely remember a paper from Andrew Appel’s group at Princeton with empirical data backing this up in practice.
My general advice:
If you know something is going to be long-lived, it makes more sense to stick it in its own heap / arena and pin it until you’re done with it.
Otherwise, avoid dividing up your heaps; make big heaps and collect only when necessary.
The confounding factor here is cache. It’s really fast to run a GC sweep in the cache. It’s much slower if the set of live objects doesn’t fit in the cache because GC is a random graph walk, which will require things to be brought in from main memory and will not work well with prefetchers. At the same time, it will blow away the cache entirely and so make things slower for a bit after GC.
This was the premise behind Sun Labs’ Project Maxwell (mostly unpublished, sadly). They observed that a lot of short-lived objects had lifetimes that ended before they were evicted from cache. If you do a hardware young-generation sweep in the cache, you can delete a very large proportion of objects before they leave the cache. This improves GC performance and also performance of mutator threads (cache utilisation improves). It’s a shame internal politics between the Niagara and Rock teams killed that project.
When copying, you are doing work to evict data when the percentage of live data is still relatively high. If you wait until absolutely necessary, many pages will never be touched at all.
When scanning an allocation for pointers, the linked allocations also tend to be highly local. So the cache misses aren’t particularly a factor relative to NOT paging in most of the pages used.
Using a few microbenchmarks to make general assumptions about the efficiency of GC algorithms is something one should be extremely sceptical of. It all depends on allocation and usage patterns, which is very application-specific. There is no perfect fit.
It does read like the combination of the microbenchmarks and the gc tuning doesn’t let the generational hypothesis apply, which I think is where the author ends up.
Fwiw, 16MB in a 300MB heap does seem like a small nursery to me. Iirc, the hotspot JVM allocates 25% of the total heap to the nursery if you don’t otherwise configure it.
Perhaps they didn’t read the Blackburn, Cheng, and McKinley paper they referenced in the 2nd paragraph?
the best GC is never GCing, just let the OS clean up when you quit
if you can’t afford the address space to never GC (aka your program is too long-running) then the next best thing it to make the bump-ptr nursery as large as you can afford i.e. half the free memory space, so you’re guaranteed to be able to copy everything to the mature generation if it turns out to be all live.
That may work for servers, or devices running a single process. It gets more complicated when you’re one of many processes in memory. What does “half the free memory space” even mean? What happens when five or six processes all try to allocate that much and fill it with objects? How much does GC slow down when a bunch of your nursery has been paged out by the time you collect it?
Deciding the initial size of the heap is one of those engineering questions that has no perfect answer. Make it too large and indeed you cause conflict with other users / tasks, make it too small and you can waste a lot of time on GCs while you slowly grow your heap.
On a machine with 16 GB or 128 GB or whatever I’d suggest that starting with, say, a 64 KB heap would be unnecessarily restrictive. 8 GB would be too much. Pick the answer in between that you are comfortable with. Perhaps do some research and find out what heap size most programs settle down to and allocate that, or something close to it (a half or a quarter, say) right from the start.
I’ve been contributing code to and tuning real-world GCs for about 25 years on a wide variety of devices, from embedded systems with a few KB of RAM, to mid 2000’s “feature phones” with 400 KB to 2 MB RAM (at Innaworks), to modern Android and iOS phones and PCs and servers. I’ve worked for example at Samsung R&D tuning and improving GC algorithms for Android Dalvik/ART and the Tizen DotNet JIT/runtime.
Paging anonymous memory to disk and having in process scanning garbage collection are indeed basically oil and water.
So I haven’t built a system of this sort, and I’m sure there are better choices than what I can come up with in the comments here, but start by allocating a block of memory.
When that block approaches being half full, do a GC, copying to the free space.
If that didn’t free enough memory, ask the OS for more—not a small amount more, but a lot more (maybe double what you had before). If you’re using a “large enough” proprortion of physical memory, don’t ask for more. Continue processing but you’ll probably enter a GC death spiral where you’re repeatedly calling GC without doing much work. When that happens, die.
If you did start swapping…curse overcommit? (You can actually track the amount of time spent in GC and die in this scenario).
I think systems that are willing to overcommit anonymous memory are probably less likely to come configured with disk swap, because it’s less necessary for correct operation in the average case (i.e., when there’s not a run on the bank for pages). Instead you tend to have the OOM killer, a dubious foreclosure mechanism.
In a system without overcommit, an allocation request requires a strict reservation of either physical memory, or somewhere else to be able to promise you’ll be able to store the data so that you’re never caught in a situation that the OOM killer would putatively solve. Classical UNIX things like fork() don’t work very well if you don’t have a bunch of swap to paper over the short term duplication of reservation. Or, you don’t get to make good use of the physical memory you actually have – some of it has to remain fallow for things to work.
MacOS has both overcommit and disk swap. There’s no OOM killer (unlike iOS, which does not swap.) if paging gets excessive or disk space runs too low, a supervisor process starts suspending processes and pops up an alert asking you to select apps to be killed. It’s happened to me a few times and it works pretty well at letting you fix things without having to reboot.
It also has the “compressed memory” business, which I find somewhat fascinating. It’s novel, as far as swap targets go, except that it feels like it would be impossible to assess the capacity of it like you can with, say, a disk partition. I guess the fact that all Mac OS X systems are interactive allows you to do things like the pause and prompt you’re describing.
It isn’t novel, Windows and Linux both have memory compression paging support. In Linux their is both zram and zswap (I’ve only ever used zswap). I even remember seeing on Phoronix that Linux recently got support for using the hardware accelerated compression engine on Intel’s server chips for zswap. Windows has memory compression, the easiest way to tell that it does is it shows how much memory has been saved by compression in task manager.
EDIT: Unless you meant asking the user what apps to kill part, I agree that is novel.
The concept goes back at least as far as Connectix’s RAMDoubler utility from the mid-90s.
The article also makes the point that nursery size is important. Can a generational collector self-tune in this regard?
e.g. increase/decrease proportional nursery size after every nursery GC depending on proportion of nursery items freed. (High %age => bigger nursery may be useful and vice versa). Of course, this is swapping one heuristic for another, but this one might help tune the system to different kinds of workload?
Some collectors can self-tune, yes. The G1GC for Hotspot has a large number of independent regions, and after each young-gen collection, it can adjust the number of them to try and hit pause and throughput goals (which are the standard way of tuning that collector).
Great article! The scientific method at work. It’s always good to question and revalidate old results, especially since hardware evolution (caches, branch prediction…) can change the “laws of nature”, or ant least physical constants, out from under us.
It sounds to me like the choice of benchmarks could be an issue. I’ve been rereading various GC papers recently; they frequently seem to use compilers, like javac, as benchmarks, which seems reasonable.
I’d have thought compilers were pretty unrepresentative for GC. They create a big AST as they parse, which exists as long as the program is valid. Then they walk it and generate code, some of which may be shortlived, but something like javac doesn’t really do any optimisation (it used to, then HotSpot spent the first bit of the load having to undo the javac optimisation) and is little more than a streaming write of the AST. Most programs have a much broader set of lifetimes than compilers.
Hans Boehm has a nice anecdote about adding the BDW GC to GCC. In his first version, it made GCC quite a lot faster. It turned out that it was equivalent to just never calling
free: the compiler was so short-lived that the threshold for triggering GC was never reached. LLVM learned from this and mostly uses arena allocators, and has an option (on by default in release builds, I believe) for clang not to ever free things. Clangd needs to reclaim memory because it’s a long-lived thing, but most objects in clang are live for most of the duration that the tool runs and so it’s faster to simply letexitbe your GC.For what it’s worth, ZGC and Shenandoah both experienced a large throughput boost after moving to a generational GC (they were already low-latency=bounded pause time), on real world benchmarks as well.
The amount of work copying collectors do in a collect cycle is proportional to the LIVE data only (unlike mark-and-sweep, which is proportional to the size of the heap).
With a copying collector, as long as the percentage of live data is low relative to the overall heap size, the amortized cost will be a lot lower than mark-and-sweep or reference counting. And in practice, the pause shouldn’t be noticeable for most workloads, with large heaps.
Lots of little collections of a nursery aren’t going to amortize as well, because the average percentage of live data at the time of collect will be much higher.
I vaguely remember a paper from Andrew Appel’s group at Princeton with empirical data backing this up in practice.
My general advice:
The confounding factor here is cache. It’s really fast to run a GC sweep in the cache. It’s much slower if the set of live objects doesn’t fit in the cache because GC is a random graph walk, which will require things to be brought in from main memory and will not work well with prefetchers. At the same time, it will blow away the cache entirely and so make things slower for a bit after GC.
This was the premise behind Sun Labs’ Project Maxwell (mostly unpublished, sadly). They observed that a lot of short-lived objects had lifetimes that ended before they were evicted from cache. If you do a hardware young-generation sweep in the cache, you can delete a very large proportion of objects before they leave the cache. This improves GC performance and also performance of mutator threads (cache utilisation improves). It’s a shame internal politics between the Niagara and Rock teams killed that project.
When copying, you are doing work to evict data when the percentage of live data is still relatively high. If you wait until absolutely necessary, many pages will never be touched at all.
When scanning an allocation for pointers, the linked allocations also tend to be highly local. So the cache misses aren’t particularly a factor relative to NOT paging in most of the pages used.
Using a few microbenchmarks to make general assumptions about the efficiency of GC algorithms is something one should be extremely sceptical of. It all depends on allocation and usage patterns, which is very application-specific. There is no perfect fit.
I’d love to see more info or lore on this topic, their results baffle me too.