Chris makes some points which are interesting in their own right, particularly the one comparing the amount of research which has gone into tracing/copying gc vs rc. But I think there is an unmentioned bias which is worth pointing out. Tracing gc has found massive success in enterprise java applications, which:
Need to scale in terms of code size and contributor count, and therefore need the modularity afforded by tracing gc
Need to scale in terms of heap size, and therefore run into the pathological fragmentation issues that come from non-compacting gc
Need the higher throughput afforded by tracing gc
Generally run on massive servers that have the cores, memory size, memory speed, power budget, and heat dissipation required to support high-quality tracing gc
Contrariwise, swift is a product of apple, a laptop and phone company, and is used to make laptop and phone apps which:
Are unlikely to ever get really huge (people balk at large downloads)
May run with limited memory, and need to share that memory with other apps
Run on battery-powered hardware with little or no active cooling, and so need to attempt to conserve power
No silver bullet, as they say. I implement APL, which absolutely cannot do without reference counting, except perhaps in some novel configurations. I think that for the more usual case of a pointer-chasing language, tracing gc is almost certainly the right choice.
Lastly, however, I would be remiss not to bring up bacon et al, ‘A Unified Theory of Garbage Collection’, which notes that tracing gc and reference counting are not really opposed to one another, and that a sufficiently advanced implementation of either will approach the other. Lattner mentions barriers (though it is important to note that barriers are much cheaper than reference count twiddling); on the other side of the coin, consider deferred reference counting, one-bit reference counts, …
The bottom line being that memory management (or more generally allocation of finite resources; see e.g. register allocation) is hard, in all cases; a sufficiently robust and general solution will end up being rather complicated no matter what it does; and, again, there is no silver bullet.
I think the first time I met Chris was at FOSDEM 2011, when I had almost exactly this conversation with him. I was giving a main track talk, he was giving a keynote and so we were staying in the same hotel and had breakfast together. I complained a lot about GC in Objective-C. I had two main objections:
First, the type system didn’t differentiate between memory that could and couldn’t store Objective-C pointers. NSAllocateCollectable had a flag indicating whether the GC’d buffer could contain pointers to GC’d objects but the return type was void* in both cases, as was the return type of malloc. If you cast a void* buffer to id*, you could store pointers in it without any compiler warnings. If the underlying storage for the buffer were on the stack, in a global, the return from NSAllocateCollectable with NSScannedOption, or a field (possibly an array) in an object, then everything worked fine. If it were anything else, then sooner or later the GC would deallocate the object and you’d have a dangling pointer. This meant that Objective-C with GC had more memory safety bugs than with manual retain / release.
Second, there was no good incremental adoption story. You had to port your Objective-C to work with GC’d mode and then you could also do manual memory management and compile it in a mode where it ran with both modes, but this significantly increased your testing burden and meant that you couldn’t actually use the GC’d version until all of your dependencies had been upgraded. Destruction and finalisation were completely different code paths and you needed to implement both.
To address the second (on the train to a prior FOSDEM), I’d hacked together something based on the Bacon paper that used the GC barriers that the compiler inserted in GC’d mode but used them to do retain / release operations and then added a cycle detector on top of this. The cycle detector used Objective-C reflection metadata and a fallback path that allowed an object to explicitly walk data structures that were not exposed in this way (e.g. malloc‘d buffers). This let you use retain/release and GC’d object files together and get the cycle detection spanning both. The code had some concurrency bugs but it served as a reasonable proof of concept. During collection of a garbage cycle, it deferred the actual deletion until all destructors had run.
After listening to my rant, Chris smiled and told me I was going to be very happy that summer. The fact that ARC was announced at WWDC on my birthday is, I’m sure, a coincidence, but I like to think it was Apple’s birthday present to me.
I still like the determinism of reference counting but it has one very big problem: it does not play nicely with modern cache coherency protocols. For scalable memory use, you want every object to be in one of two states:
Mutable, used by a single core at a time (exclusive state in that core’s cache, invalid everywhere else)
Immutable, used by as many cores as you like (shared state in all caches)
Reference counting for mutable objects in this scheme is completely fine. For immutable objects, it is bad because every reference count update is a read-modify-write operation. This is solvable, but only if your type system understands and guarantees global immutability.
In Verona, when you freeze a region (which creates an immutable object graph from the objects in that region and all child regions), we walk the graph to do SCC construction and generate reference counts for each set of strongly connected components (SCCs). This lets us use reference counting even for cyclic data structures by taking advantage of the fact that the immutability guarantee means that the shape of the object graph can’t change, so any cycle shares a refcount between all objects. Having one reference count for each SCC makes this problem worse in the first instance because taking a reference to any object potentially introduces false sharing with the refcount of another object in the same SCC. The immutability guarantee should offset this significantly by allowing us to significantly reduce refcount manipulations. There was a lot of work on refcount elision in the ’80s that was obsolete in the ‘90s when multithreading became popular and all of the invariants stopped holding in the presence of concurrent mutation. This all becomes possible again in the immutable case because there’s no concurrent mutation. This means that if we hold a reference to object A, then we don’t need to do refcount manipulations for any transient references that we hold to any object reachable from A. We need to incref / decref when assigning to a field in the heap, but we can (if we need to) optimise that as well by keeping a remembered set per region so that we have a local refcount for each SCC that we’re holding in a particular region and only increment / decrement the global reference count on transitions between 0 and 1 for the local refcount.
The Verona type system also guarantees that no mutable object is shared between threads and so we can do non-atomic refcount manipulation, which simplifies a bunch of things. Reference counting requires a tradeoff between refcount size and cost of handling overflow. You can use a 64-bit refcount and guarantee that you never see overflow (okay, if you incref once per cycle on a 2GHz machine for 2000 years, you’ll see an overflow, but if your program has 2000 years of uptime then you’re winning), but then you’re burning 8 bytes per object. That’s fine for Objective-C, where objects are generally large, but not so fine for languages that encourage small objects with 1-2 fields. You can steal the low bits from the descriptor (/ vtable / isa) pointer, but then you have to handle the overflow case. This is really annoying for atomic refcounting because now you have to do a load/cmpxchg pair and branch on overflow. It’s much simpler for non-atomic cases because you can ignore the case of two increfs coming in at the same time, but it still requires the fallback code.
One of the starting points for Verona was the acknowledgement that there is no on-size-fits-all solution to memory management. We want you to be able to choose a memory management strategy for each region, independently. If you know that you have linear ownership then we can avoid any dynamic book keeping. If you know that you have no cycles (or you’re happy for cycles to leak until the region is destroyed), then we can do refcounting. If you want complex object graphs, we can do tracing (but the fact that regions are single-owner means that we don’t need to do concurrent GC to avoid pauses, we can do local collection on the region at any point when its owner doesn’t have anything else to do). If you have a load of short-lived objects then we can do arena allocation and deallocate them all at once with no sub-region memory management policy at all.
I still like the determinism of reference counting
‘RC as deterministic’ is a bad meme. For resources other than memory (files, mutexes…), scope-bound or linear handling is more predictable and suffices for simple cases; complex cases (say, caching server) will degenerate to manual management no matter what you do (e.g. in the case of a caching server where you’ve kept around a reference to a file that you want to close, you would rather close it and then crash and burn when somebody tries to read through the other reference, rather than leak it, so if you write something like that in a gc or rc language, you will bypass those mechanisms). For memory, the possibility of cycles and rc cascade means you sacrifice modularity for that predictability, and pay in leaks or pauses when you make a mistake. Whereas gc won’t pause if you use a good concurrent or incremental implementation, and eliminates one class of leak; both of which make it easier to reason about the behaviour of programs using code you didn’t write.
One of the starting points for Verona was the acknowledgement that there is no on-size-fits-all solution to memory management
Sorry, ‘deterministic’ has a lot of overloaded meanings and you’re right that I’m implying a lot more than I actually want to claim. In the context of reference counting, I specifically mean that your worst-case peak memory consumption is deterministic. This is important for a lot of use cases. On a mobile device, it’s the number that causes your app to be killed. For a server / cloud deployment, it’s often the bounding function for how much you’re paying. With a global tracing implementation, your peak memory consumption depends on the concurrent interaction between your mutator threads and the collector. The collector is not under your control.
Any GC implementation is a tradeoff in three dimensions:
Memory overhead
Mutator throughput
Mutator latency
You can optimise for any of these. Stop-the-world collectors optimise for mutator throughput. They impose no penalty on mutators while they’re running (no barriers), but they introduce latency spikes and large memory overhead (memory consumption grows until the pause). Fully concurrent collectors need to synchronise mutator and collector thread states and so impose a mutator throughput penalty from barriers (as does reference counting, form the refcount manipulations) and from having GC threads running (RC can be thought of as having a GC thread that preempts the mutator thread and runs on every decref-to-zero) but don’t impose a latency penalty. Most GC implementations (including ObjC / Swift’s ARC) are tuned for one or two places in this tradeoff space. Our hypothesis for Verona is that, as programs grow in size and complexity, there is no point in this tradeoff space that is a good fit for all of the program.
Thank you for your fascinating and thorough comments!
I thought I’d been keeping up to date with Verona’s docs, but apparently not, as a lot of this is new to me! I was particularly surprised that Verona was exploring the idea of making regions immutable.
Is there anything public that I can read on this? I’d love to compare it to our approach for Vale at [0], where we can immutably open a mutex, or implicitly freeze all pre-existing memory when entering a pure function.
Also, where is the Verona team based? If it’s in the US, I’d love to stop by and buy a round for them all and talk regions!
I thought I’d been keeping up to date with Verona’s docs, but apparently not, as a lot of this is new to me! I was particularly surprised that Verona was exploring the idea of making regions immutable.
Freezing regions to create immutable graphs has always been part of our abstract machine. It’s something that you can do only if your type system has viewpoint adaptation, to allow you to express the idea that an object is immutable because the object whose field you read to access it was also immutable.
Is there anything public that I can read on this? I’d love to compare it to our approach for Vale at [0], where we can immutably open a mutex, or implicitly freeze all pre-existing memory when entering a pure function.
Not a huge amount, we’ve open sourced it to collaborate with a few universities but we’re not great at documentation yet.
Also, where is the Verona team based? If it’s in the US, I’d love to stop by and buy a round for them all and talk regions!
Having one reference count for each SCC makes this problem worse in the first instance because taking a reference to any object potentially introduces false sharing with the refcount of another object in the same SCC.
Sorry, what does “in the first instance” refer to here?
First-order effect. It enables later optimisations but without those it makes the false sharing worse. As second-order effects, we believe, the combination lets us avoid any tracing after the freeze and makes it easy to elide refcounts via simple local dominance analysis but as a first-order effect it increases false sharing on refcount manipulations.
Yeah–the sort of Occam’s Razor explanation of Apple going this direction is that your initial reference counting implementation is likely to have have decent latency characteristics, which is handy if you want 60fps UI updates on a phone, and then you go back and work to improve throughput.
With GC you can get good multithreaded throughput earlier, but there’s a long road to combining that with almost-pauseless operation.
And other factors: trying to be compatible with C complicates moving collectors, and like you mention GC really benefits from spare RAM, which was not really available on early smartphones.
To me, an area that seems interesting to explore is that Rust has shown that humans can handle annotating their code with a lot of info on lifetimes and what is or is not shared. We’ve seen how that works out with Rc/Arc as the main escape hatch for lifetimes that are hard to capture statically (an almost inevitable choice for Rust since it’s trying to work in a systems context). It doesn’t seem like it’s been worked out quite as far on the GC side.
And other factors: trying to be compatible with C complicates moving collectors
Having implemented a moving collector for C, I would go a step further: you cannot have a moving collector without breaking large amounts of real-world C code or imposing very large performance penalties. In C, it is very common to cast a pointer to an integer and then use that as a key in a hash table or tree. Trees are kind-of okay, because you just need to keep a stable ordering, so a moving collector that compacts within an arena works. Hash tables are not because address equality is also used for identity. Identity is where the problems start.
In Java, for example, there is a hashCode method that returns the address of an object the first time that it’s called. Subsequent calls will return the address that the object had on the first call. If the object is relocated then the old address stays in the header. This is fine in Java because hashCode is not guaranteed to be unique, it is only guaranteed to have a low probability of collision. It’s also possible in Java because arrays are objects with an indexing operator. In C, you cannot grow an object during relocation because it would break array address computation, so you’d need to unconditionally allocate space for the stashed address (and, because arrays don’t really exist in the C type system, you can’t us the index of an object in an array to find the address of the start of an array and amortise this in arrays). Worse, you can take the address of fields in C, so this problem get a whole lot worse if you’re using field-address-equality comparisons to define identity and an object is relocated.
To support a moving collector in C, you’d need (at the very least) to add padding to every object and make pointers fat pointers so that interior pointers can find the start of the object. That would be a completely new ABI that broke assumptions that a lot of C code has.
you cannot have a moving collector without breaking large amounts of real-world C code or imposing very large performance penalties
There is another option: virtual memory tricks. (Interleave sparse pages with disjoint coverage.) I came up with this scheme on a lark, and was shocked to find an actual academic paper had been written on the subject. I don’t have a pointer handy, but I presume it was written by emery berger.
In Java, for example, there is a hashCode method that returns the address of an object the first time that it’s called. Subsequent calls will return the address that the object had on the first call. If the object is relocated then the old address stays in the header
There is an interesting alternative to this approach, which I believe is used by SBCL: when objects are relocated, tell hash tables containing them to rehash themselves. This is nice because you can still use an object’s address as a hash. But SBCL only has a copying nursery; its oldspace is immobile; I expect that for the more advanced java gcs, the overhead of rehashing tables in oldspace ends up not being worth it.
That would be a completely new ABI that broke assumptions that a lot of C code has.
I find it rather unfortunate the extent to which c programs rely on vendor-specific behaviour. It is of course valuable that mcu vendors can provide bizarro compilers, but this is of fairly little significance to most application programmers, and somewhat limits the use of the standard for the latter. Because it is not practical to change the entire body of existing c code, I would like to see some standardisation–whether optional, unofficial, whatever—of these language features which are exploited by nearly every real c program. Appendix J does this in principle, but not in practice.
I came up with this scheme on a lark, and was shocked to find an actual academic paper had been written on the subject. I don’t have a pointer handy, but I presume it was written by emery berger.
Emery Berger did some fairly recent work in this space, but the original paper proposing this was actually written by Vikram Adve, Chris Lattner’s supervisor.
There is an interesting alternative to this approach, which I believe is used by SBCL: when objects are relocated, tell hash tables containing them to rehash themselves.
That works if you can track everything that depends on the address. This isn’t the case for C, where you can cast a pointer to a long and the lost that integer value under an unbounded amount of data flow through type punning.
Because it is not practical to change the entire body of existing c code, I would like to see some standardisation–whether optional, unofficial, whatever—of these language features which are exploited by nearly every real c program
So many words are dedicated to explaining how users have control of the machine. But at scale, even at a few thousand lines of code, users must constantly trade away machine control in favor of uniform simplifying abstractions.
The main problem with reference-counting, and the reason it’s not found in JIT toolkits, is that counting references is not compositional. Garbage-collected algorithms become compositional because objects are uniformly managed beyond the user’s control. Combine this with the phenomenon that garbage-collection nurseries can sometimes effectively avoid allocating permanent object storage, and we have not just a uniform abstraction which simplifies memory management, but a complete memory-safety solution.
(Yes, I know that algebraic effects can be used to manage references with composition, but Swift does not do that, and Lattner does not discuss it.)
At runtime, JIT-compiled code has to check its invariants. For example, every object about to be referenced should be alive. This is easy, even trivial, for runtimes with GC; the ability to reference an object implies liveness. But with RC, each object has to be examined, sometimes multiple times per compilation unit, because RC can reap an object despite native code referring to it.
Replicating my take from hackernews:
Chris makes some points which are interesting in their own right, particularly the one comparing the amount of research which has gone into tracing/copying gc vs rc. But I think there is an unmentioned bias which is worth pointing out. Tracing gc has found massive success in enterprise java applications, which:
Need to scale in terms of code size and contributor count, and therefore need the modularity afforded by tracing gc
Need to scale in terms of heap size, and therefore run into the pathological fragmentation issues that come from non-compacting gc
Need the higher throughput afforded by tracing gc
Generally run on massive servers that have the cores, memory size, memory speed, power budget, and heat dissipation required to support high-quality tracing gc
Contrariwise, swift is a product of apple, a laptop and phone company, and is used to make laptop and phone apps which:
Are unlikely to ever get really huge (people balk at large downloads)
May run with limited memory, and need to share that memory with other apps
Run on battery-powered hardware with little or no active cooling, and so need to attempt to conserve power
No silver bullet, as they say. I implement APL, which absolutely cannot do without reference counting, except perhaps in some novel configurations. I think that for the more usual case of a pointer-chasing language, tracing gc is almost certainly the right choice.
Lastly, however, I would be remiss not to bring up bacon et al, ‘A Unified Theory of Garbage Collection’, which notes that tracing gc and reference counting are not really opposed to one another, and that a sufficiently advanced implementation of either will approach the other. Lattner mentions barriers (though it is important to note that barriers are much cheaper than reference count twiddling); on the other side of the coin, consider deferred reference counting, one-bit reference counts, …
The bottom line being that memory management (or more generally allocation of finite resources; see e.g. register allocation) is hard, in all cases; a sufficiently robust and general solution will end up being rather complicated no matter what it does; and, again, there is no silver bullet.
I think the first time I met Chris was at FOSDEM 2011, when I had almost exactly this conversation with him. I was giving a main track talk, he was giving a keynote and so we were staying in the same hotel and had breakfast together. I complained a lot about GC in Objective-C. I had two main objections:
First, the type system didn’t differentiate between memory that could and couldn’t store Objective-C pointers.
NSAllocateCollectable
had a flag indicating whether the GC’d buffer could contain pointers to GC’d objects but the return type wasvoid*
in both cases, as was the return type ofmalloc
. If you cast avoid*
buffer toid*
, you could store pointers in it without any compiler warnings. If the underlying storage for the buffer were on the stack, in a global, the return fromNSAllocateCollectable
withNSScannedOption
, or a field (possibly an array) in an object, then everything worked fine. If it were anything else, then sooner or later the GC would deallocate the object and you’d have a dangling pointer. This meant that Objective-C with GC had more memory safety bugs than with manual retain / release.Second, there was no good incremental adoption story. You had to port your Objective-C to work with GC’d mode and then you could also do manual memory management and compile it in a mode where it ran with both modes, but this significantly increased your testing burden and meant that you couldn’t actually use the GC’d version until all of your dependencies had been upgraded. Destruction and finalisation were completely different code paths and you needed to implement both.
To address the second (on the train to a prior FOSDEM), I’d hacked together something based on the Bacon paper that used the GC barriers that the compiler inserted in GC’d mode but used them to do retain / release operations and then added a cycle detector on top of this. The cycle detector used Objective-C reflection metadata and a fallback path that allowed an object to explicitly walk data structures that were not exposed in this way (e.g.
malloc
‘d buffers). This let you use retain/release and GC’d object files together and get the cycle detection spanning both. The code had some concurrency bugs but it served as a reasonable proof of concept. During collection of a garbage cycle, it deferred the actual deletion until all destructors had run.After listening to my rant, Chris smiled and told me I was going to be very happy that summer. The fact that ARC was announced at WWDC on my birthday is, I’m sure, a coincidence, but I like to think it was Apple’s birthday present to me.
I still like the determinism of reference counting but it has one very big problem: it does not play nicely with modern cache coherency protocols. For scalable memory use, you want every object to be in one of two states:
Reference counting for mutable objects in this scheme is completely fine. For immutable objects, it is bad because every reference count update is a read-modify-write operation. This is solvable, but only if your type system understands and guarantees global immutability.
In Verona, when you freeze a region (which creates an immutable object graph from the objects in that region and all child regions), we walk the graph to do SCC construction and generate reference counts for each set of strongly connected components (SCCs). This lets us use reference counting even for cyclic data structures by taking advantage of the fact that the immutability guarantee means that the shape of the object graph can’t change, so any cycle shares a refcount between all objects. Having one reference count for each SCC makes this problem worse in the first instance because taking a reference to any object potentially introduces false sharing with the refcount of another object in the same SCC. The immutability guarantee should offset this significantly by allowing us to significantly reduce refcount manipulations. There was a lot of work on refcount elision in the ’80s that was obsolete in the ‘90s when multithreading became popular and all of the invariants stopped holding in the presence of concurrent mutation. This all becomes possible again in the immutable case because there’s no concurrent mutation. This means that if we hold a reference to object A, then we don’t need to do refcount manipulations for any transient references that we hold to any object reachable from A. We need to incref / decref when assigning to a field in the heap, but we can (if we need to) optimise that as well by keeping a remembered set per region so that we have a local refcount for each SCC that we’re holding in a particular region and only increment / decrement the global reference count on transitions between 0 and 1 for the local refcount.
The Verona type system also guarantees that no mutable object is shared between threads and so we can do non-atomic refcount manipulation, which simplifies a bunch of things. Reference counting requires a tradeoff between refcount size and cost of handling overflow. You can use a 64-bit refcount and guarantee that you never see overflow (okay, if you incref once per cycle on a 2GHz machine for 2000 years, you’ll see an overflow, but if your program has 2000 years of uptime then you’re winning), but then you’re burning 8 bytes per object. That’s fine for Objective-C, where objects are generally large, but not so fine for languages that encourage small objects with 1-2 fields. You can steal the low bits from the descriptor (/ vtable / isa) pointer, but then you have to handle the overflow case. This is really annoying for atomic refcounting because now you have to do a load/cmpxchg pair and branch on overflow. It’s much simpler for non-atomic cases because you can ignore the case of two increfs coming in at the same time, but it still requires the fallback code.
One of the starting points for Verona was the acknowledgement that there is no on-size-fits-all solution to memory management. We want you to be able to choose a memory management strategy for each region, independently. If you know that you have linear ownership then we can avoid any dynamic book keeping. If you know that you have no cycles (or you’re happy for cycles to leak until the region is destroyed), then we can do refcounting. If you want complex object graphs, we can do tracing (but the fact that regions are single-owner means that we don’t need to do concurrent GC to avoid pauses, we can do local collection on the region at any point when its owner doesn’t have anything else to do). If you have a load of short-lived objects then we can do arena allocation and deallocate them all at once with no sub-region memory management policy at all.
‘RC as deterministic’ is a bad meme. For resources other than memory (files, mutexes…), scope-bound or linear handling is more predictable and suffices for simple cases; complex cases (say, caching server) will degenerate to manual management no matter what you do (e.g. in the case of a caching server where you’ve kept around a reference to a file that you want to close, you would rather close it and then crash and burn when somebody tries to read through the other reference, rather than leak it, so if you write something like that in a gc or rc language, you will bypass those mechanisms). For memory, the possibility of cycles and rc cascade means you sacrifice modularity for that predictability, and pay in leaks or pauses when you make a mistake. Whereas gc won’t pause if you use a good concurrent or incremental implementation, and eliminates one class of leak; both of which make it easier to reason about the behaviour of programs using code you didn’t write.
I’m curious: what are your thoughts on ‘A Framework for Gradual Memory Management’?
Sorry, ‘deterministic’ has a lot of overloaded meanings and you’re right that I’m implying a lot more than I actually want to claim. In the context of reference counting, I specifically mean that your worst-case peak memory consumption is deterministic. This is important for a lot of use cases. On a mobile device, it’s the number that causes your app to be killed. For a server / cloud deployment, it’s often the bounding function for how much you’re paying. With a global tracing implementation, your peak memory consumption depends on the concurrent interaction between your mutator threads and the collector. The collector is not under your control.
Any GC implementation is a tradeoff in three dimensions:
You can optimise for any of these. Stop-the-world collectors optimise for mutator throughput. They impose no penalty on mutators while they’re running (no barriers), but they introduce latency spikes and large memory overhead (memory consumption grows until the pause). Fully concurrent collectors need to synchronise mutator and collector thread states and so impose a mutator throughput penalty from barriers (as does reference counting, form the refcount manipulations) and from having GC threads running (RC can be thought of as having a GC thread that preempts the mutator thread and runs on every decref-to-zero) but don’t impose a latency penalty. Most GC implementations (including ObjC / Swift’s ARC) are tuned for one or two places in this tradeoff space. Our hypothesis for Verona is that, as programs grow in size and complexity, there is no point in this tradeoff space that is a good fit for all of the program.
Thank you for your fascinating and thorough comments!
I thought I’d been keeping up to date with Verona’s docs, but apparently not, as a lot of this is new to me! I was particularly surprised that Verona was exploring the idea of making regions immutable.
Is there anything public that I can read on this? I’d love to compare it to our approach for Vale at [0], where we can immutably open a mutex, or implicitly freeze all pre-existing memory when entering a pure function.
Also, where is the Verona team based? If it’s in the US, I’d love to stop by and buy a round for them all and talk regions!
https://verdagon.dev/blog/zero-cost-refs-regions
Freezing regions to create immutable graphs has always been part of our abstract machine. It’s something that you can do only if your type system has viewpoint adaptation, to allow you to express the idea that an object is immutable because the object whose field you read to access it was also immutable.
Not a huge amount, we’ve open sourced it to collaborate with a few universities but we’re not great at documentation yet.
We’re in Cambridge (UK).
Sorry, what does “in the first instance” refer to here?
First-order effect. It enables later optimisations but without those it makes the false sharing worse. As second-order effects, we believe, the combination lets us avoid any tracing after the freeze and makes it easy to elide refcounts via simple local dominance analysis but as a first-order effect it increases false sharing on refcount manipulations.
Yeah–the sort of Occam’s Razor explanation of Apple going this direction is that your initial reference counting implementation is likely to have have decent latency characteristics, which is handy if you want 60fps UI updates on a phone, and then you go back and work to improve throughput.
With GC you can get good multithreaded throughput earlier, but there’s a long road to combining that with almost-pauseless operation.
And other factors: trying to be compatible with C complicates moving collectors, and like you mention GC really benefits from spare RAM, which was not really available on early smartphones.
To me, an area that seems interesting to explore is that Rust has shown that humans can handle annotating their code with a lot of info on lifetimes and what is or is not shared. We’ve seen how that works out with
Rc
/Arc
as the main escape hatch for lifetimes that are hard to capture statically (an almost inevitable choice for Rust since it’s trying to work in a systems context). It doesn’t seem like it’s been worked out quite as far on the GC side.Having implemented a moving collector for C, I would go a step further: you cannot have a moving collector without breaking large amounts of real-world C code or imposing very large performance penalties. In C, it is very common to cast a pointer to an integer and then use that as a key in a hash table or tree. Trees are kind-of okay, because you just need to keep a stable ordering, so a moving collector that compacts within an arena works. Hash tables are not because address equality is also used for identity. Identity is where the problems start.
In Java, for example, there is a
hashCode
method that returns the address of an object the first time that it’s called. Subsequent calls will return the address that the object had on the first call. If the object is relocated then the old address stays in the header. This is fine in Java becausehashCode
is not guaranteed to be unique, it is only guaranteed to have a low probability of collision. It’s also possible in Java because arrays are objects with an indexing operator. In C, you cannot grow an object during relocation because it would break array address computation, so you’d need to unconditionally allocate space for the stashed address (and, because arrays don’t really exist in the C type system, you can’t us the index of an object in an array to find the address of the start of an array and amortise this in arrays). Worse, you can take the address of fields in C, so this problem get a whole lot worse if you’re using field-address-equality comparisons to define identity and an object is relocated.To support a moving collector in C, you’d need (at the very least) to add padding to every object and make pointers fat pointers so that interior pointers can find the start of the object. That would be a completely new ABI that broke assumptions that a lot of C code has.
There is another option: virtual memory tricks. (Interleave sparse pages with disjoint coverage.) I came up with this scheme on a lark, and was shocked to find an actual academic paper had been written on the subject. I don’t have a pointer handy, but I presume it was written by emery berger.
There is an interesting alternative to this approach, which I believe is used by SBCL: when objects are relocated, tell hash tables containing them to rehash themselves. This is nice because you can still use an object’s address as a hash. But SBCL only has a copying nursery; its oldspace is immobile; I expect that for the more advanced java gcs, the overhead of rehashing tables in oldspace ends up not being worth it.
I find it rather unfortunate the extent to which c programs rely on vendor-specific behaviour. It is of course valuable that mcu vendors can provide bizarro compilers, but this is of fairly little significance to most application programmers, and somewhat limits the use of the standard for the latter. Because it is not practical to change the entire body of existing c code, I would like to see some standardisation–whether optional, unofficial, whatever—of these language features which are exploited by nearly every real c program. Appendix J does this in principle, but not in practice.
Emery Berger did some fairly recent work in this space, but the original paper proposing this was actually written by Vikram Adve, Chris Lattner’s supervisor.
That works if you can track everything that depends on the address. This isn’t the case for C, where you can cast a pointer to a long and the lost that integer value under an unbounded amount of data flow through type punning.
Take a look at our PLDI paper from a few years back.
(Random thought: maybe the huge caches on apple cpus are in part a way of compensating for the locality loss incurred by the lack of a moving gc?)
So many words are dedicated to explaining how users have control of the machine. But at scale, even at a few thousand lines of code, users must constantly trade away machine control in favor of uniform simplifying abstractions.
The main problem with reference-counting, and the reason it’s not found in JIT toolkits, is that counting references is not compositional. Garbage-collected algorithms become compositional because objects are uniformly managed beyond the user’s control. Combine this with the phenomenon that garbage-collection nurseries can sometimes effectively avoid allocating permanent object storage, and we have not just a uniform abstraction which simplifies memory management, but a complete memory-safety solution.
(Yes, I know that algebraic effects can be used to manage references with composition, but Swift does not do that, and Lattner does not discuss it.)
I don’t disagree that rc is less modular than tracing gc, but what’s the connection to JIT?
At runtime, JIT-compiled code has to check its invariants. For example, every object about to be referenced should be alive. This is easy, even trivial, for runtimes with GC; the ability to reference an object implies liveness. But with RC, each object has to be examined, sometimes multiple times per compilation unit, because RC can reap an object despite native code referring to it.
Sounds a lot like barriers checking if an object has been moved!
on this topic: https://arxiv.org/abs/1908.05647