1. 29
  1. 44

    This article contains a lot of errors, I’m going to just start with the first one:

    No, it isn’t. It’s an atomic increment, perhaps with overflow checks for small integer widths. This is about as minimal as you can get short of nothing at all

    This is true only in the uncontended case. The problem with reference counting is that it turns reads into writes. On a modern CPU, an atomic in the local cache is very cheap but an atomic in a remote cache line can be expensive and this cost goes up with larger core counts. An immutable (or, not mutated) object can be in every core’s cache simultaneously in shared state but its reference count will ping-pong between all of the cores that are incrementing and decrementing it. If the refcount is in the object header, this will invalidate part of the object in other cores, causing cache misses even after the object had been acquired. If the refcount is elsewhere then you now suffer from locality problems. You can avoid this with deferred reference counting schemes, but then you lose the deterministic deallocation of reference counting.

    Both reference counting and tracing have poor performance in certain situations and it’s important to understand them if you choose one or the other. It’s also important to read the unified theory of GC paper that was recently posted here to understand the degree to which most modern automatic memory management schemes are a hybrid of the two.

    1. 10

      Supposedly the Apple Silicon ARM systems have some secret sauce that makes atomic operations extra fast — someone noticed this in a microbenchmark soon after the M1 came out. Given how reliant Apple systems are on ref-counting, this makes sense.

      1. 8

        I have no knowledge of how Apple does this, but as someone who works on hardware-language co-design, this is what I’d do:

        Apple stores the refcount (on the fast path) in some of the spare bits in the isa pointer in Objective-C. This means that it will be in the same cache line as part of the object. In addition, the semantics of Objective-C and Swift mean that a decref (release) will not be followed by any access to the object (other than, potentially, to deallocate it completely) and an incref (retain) will always be on an object that is not being concurrently freed (if it is, it’s UB and it’s fine to crash). This leads to a possible cache-coherency protocol that can make this fast:

        You want to perform ARMv8.3 atomic compare and exchange on the remote cache line. Rather than doing a broadcast invalidate, move to exclusive, and cmpxchg, you instead do send a cmpxchg message to the owning core and require that core’s cache to do the exchange and return the result.

        For this to work, you need to have an extra shared-but-owned state in your state machine. When a cache line in shared state is copied to another core, you ensure that one is in shared state and one is in shared-but-owned state. The exact best way to do this depends on your NoC architecture. If you have guaranteed bounded latency for cache-coherency messages then you can do some optimisations here.

        Now, when you do the atomic compare and swap, you send a broadcast message[1] to do the update. The lower-layer inclusive cache snoops on this and, if the line isn’t in any L1s, handles it directly, including doing a fetch from memory. If it is handled in the L2/L3 then, as an optimisation, you could also send the line in shared-but-owned state to the L1 of the requesting core: if it’s done a refcount manipulation and hadn’t accessed the object then there’s a good chance that it will again soon.

        The weak memory model on Arm lets you do a lot of other memory accesses in parallel with this. In particular, non-atomic reads of the same cache line are not ordered with respect to the atomic write in the absence of another explicit barrier and so you’re free to keep the stale version of the data in the shared-state caches until they receive the broadcast message to update it.

        It’s possible that you can do an atomic increment for retain, rather than a compare-and-swap, depending on which bits they use (I think they use the high bits?), which means that a retain then has a (predicted-not-taken) branch-on-overflow following it and you just speculatively execute everything after it and roll back in the very rare case of overflow (at which point you’re acquiring a lock and storing the refcount in a separate table anyway).

        Deallocate would also be a slow path, but if you do a little bit of value prediction to feed your branch predictor then you can probably improve this. If you have the released object in your cache line, then you predict the branch to the deallocator based on what a decrement on the local copy’s refcount would be. If two threads drop a refcount from 2 simultaneously, then one will be mispredicted, but otherwise this will always predict correctly. If you race a decref with an incref (UB anyway) then you’ll also see some mispredictions sometimes but that’s UB in the language anyway.

        [1] If you have a cache directory or similar then this can actually be a point-to-point message.

        1. 1

          I don’t quite follow. If some object is owned by another core, presumably that core is using the object, likely twiddling its reference count. So by the time your cas request arrives, there’s a good chance the reference count will have changed already, and you could fail repeatedly.

          1. 1

            I think you can use atomic increment/decrement here, which can’t fail. Not sure why the example was cas.

            1. 2

              It’s difficult to use atomic inc/dec here because the reference count bits are stored in only some of the bits of an object and you have to handle overflow / underflow in the presence of concurrent mutation. You might be able to use the almost-top bits in a pointer, so that overflow doesn’t affect the other bits, but then you have some fun corner cases.

              For retain it isn’t too bad. When you hit the overflow path, then you can store a slow-path refcount out of band and use the in-pointer value as the low bits.

              This adds some complexity to release though, because now you need to handle the case that the refcount dropped to 0 but the pointer doesn’t store the whole refcount. Ideally, you’d just call dealloc as soon as this happens (after checking the bit that indicates that there are no weak references) but now you need to acquire a lock to check the external reference count.

              From reading a random blog, it looks as if Apple stores the refcount in the top bits and uses the bit below to indicate that there’s an external reference count. I don’t know how they set that bit atomically with respect to other operations. It’s easy if you use a CAS, because you can just say ‘am I adding to 0xff, if so acquire a lock, increment the external refcount, and set the refcount to 0’, but with an atomic increment you can hit the sequence of:

              1. Thread 1 does a retain and overflows the extra retain count to 0.
              2. Thread 2 does a retain and increments the extra retain count to 1.
              3. Thread 3 does a release and decrements the extra retain count to 0.
              4. Thread 2 does a release and underflow the extra retain count, triggering dealloc.
              5. Thread 1 sets the bit in the pointer to indicate that there’s an overflow retain count.

              I might be missing something here, but I don’t know how you can avoid this kind of problem without a CAS on at least one path.

            2. 1

              If some object is owned by another core, presumably that core is using the object, likely twiddling its reference count

              Hopefully not. The cost of one round-trip between cores is on the order of tens of cycles. If you’re doing only this much work in between retaining and releasing an object then you may be better off using a value type and copying it.

          2. 1

            Yeah, it is really amazing. I have done some tests and atomic refcounts on my M1 totally blow my X86 machine out of the water. On top of that, my M1 tends to crush my X86 machine on anything memory intensive due to how much less delay it’s main memory has. Specs wise, the x86 machine should be way faster than the M1.

            1. 1

              Main memory has a similar latency to standard x86 parts. Probably what you’re observing is a combination of great memory bandwidth and really big ROB which can do a better job of hiding the latency.

              1. 1

                No, it’s not memory bandwidth/speed. It’s specifically faster at uncontended atomic operations because the apple hardware does something fancy - presumably cpu word soup involving “look aside”, “bus”, and other words/phrases I am aware of but know no actual useful specifics ()

                The gist of it is that an uncontended atomic operation is very fast, and in the contended case it regresses to somewhere in the realm that you would traditionally expect - I would guess it’s something that intel could do, but hasn’t had reason to, whereas apple’s OS APIs are built around objects with atomic refcount so presumably it is more valuable for Macs specifically than other platforms? (https://twitter.com/Catfish_Man/status/1326238434235568128?s=20)

                1. 2

                  Oh, for atomic ops, sure. But the parent said:

                  On top of that, my M1 tends to crush my X86 machine on anything memory intensive due to how much less delay it’s main memory has

                  which is what I was responding to.


                  I would guess it’s something that intel could do, but hasn’t had reason to

                  No need to guess! There is a direct analogue in x86 land.

                  x86 mandates tso architecturally, the result of which is that all stores are ‘atomic’ (=strongly ordered), the result of which is that uncontended ‘atomic’ stores are free on intel, since they are just regular stores. On other architectures, it’s frequently the case that regular stores are weakly ordered, so strongly ordered stores are expensive, even when uncontended.

                  1. 1

                    which is what I was responding to. ah, sorry I misunderstood :-O

                    x86 mandates tso architecturally oh boy do I know this

                    the result of which is that all stores are ‘atomic’ (=strongly ordered), the result of which is that uncontended ‘atomic’ stores are free on intel

                    That’s not been my experience, but I haven’t really dealt with such on intel hardware more recent than 2019 I think? (certainly that’s the vintage of my current laptop) is this a newer hardware thing? My experience has been for at least the types of atomic operations I was using uncontended operations were significantly slower, and closer to the performance of contended. Could this be r/w/m vs store only?

                    1. 2

                      r/w/m vs store only?

                      Yes, that’s what I meant—I guess I was unclear. Uncontended rmw is fast on m1 because it has to be; uncontended store is fast on x86 because it has to be.

                      1. 1

                        oh boy do I know this

                        Curious what the context is here. Were you in charge of making an x86 implementation? Or had to port an x86 app elsewhere and fix everything that was not specified strongly enough in the source?

                  2. 1

                    I thought main memory was integrated onto the chip instead of being on a separate ram stick which reduced latency. Is that not correct?

                    1. 1

                      That is my understanding - they get to have obscene amounts of bandwidth, and also remove a pile of latency due to not needing to deal with the same capacitance and leakage issues that separate chips, or worse pluggable dimms - but perhaps @Moonchild has more accurate information (based on their detailed comments elsewhere)?

                      1. 1

                        This graph from anandtech seems to show ~110ns main memory latency, which is also about what I would expect on an intel.

                        My understanding, though I am not a hardware person, is that the bulk of the latency of dram comes from the storage medium itself, so moving it closer doesn’t help much.

                  3. 1

                    I believe that’s in the uncontended case - I think in the contended case it’s theoretically impossible to not have terrible performance. But my last time dealing with theory of concurrency multiple cores were not a thing :D

                  4. 7

                    Indeed the poster child for reference counting was Objective-C, in which every class object has a reference count and from which we get ARC. Apple would say it’s more predictably performant than GC, but that didn’t mean it was free. In its successor language Swift, Apple specifically designed to reduce reference count writes—“retain/release traffic” as they call it—by relying much more on value types and copy-on-write reference object backing stores. Today when you write a new type in Swift, you choose whether it will be the reference-counted kind or the copied-value kind. If you are mindful of performance, retain/release traffic should figure into your decision. My favorite discussion of the tradeoffs is Understanding Swift Performance.

                    1. 1

                      Yeah, there are costs to GC, but this article just seemed very wrong to me. I just didn’t feel like doing numbers on current hardware to verify before complaining :)

                      My direct experience with swift is with a multithreaded ray tracer (writing raytracers is how I learn languages), and ref counting overhead was about the single largest cost. To the extent that I was having to spend all my time on optimizing by using the various unsafe internal functions to circumvent the ref counting. Maybe it’s different without such heavy contention, but that was my experience.

                      This kind of jibes with my experience trying to make refcounting in WebKit’s stringimpl atomic, and even just that showed up as a measurable (>1%) performance regression across the whole browser benchmarks, not even micro benchmarks.

                      But in the end I think that the post reflected a failure to recognize the real perf cost because it’s generally hard to see the death by a thousand cuts nature of it, whereas scanning GCs can have bad cases that can trivially show up as a noticeable pause.

                      Alternatively they have experience from the GC in 90/2000s Java and are assuming nothing has improved since.

                      1. 2

                        The Project Maxwell technical report from Sun Labs had the best description of the tradeoff that I’ve read. Any memory management scheme is a five-dimensional trade-off between:

                        • Throughput.
                        • Latency.
                        • Determinism.
                        • Memory overhead.
                        • Programmer effort.

                        If you want to optimise for throughput then a stop-the-world GC often performs best because there’s no latency or programmer effort overhead outside of the stop-the-world phase and the collector doesn’t have to have to worry about concurrent mutation when it runs and so can be very fast. They gave an example of a stop-the-world GC that ran once per day as an extreme example. iOS uses the highest-throughput policy of all: kill the process and restart. This doesn’t need to do any graph walking, it just reclaims and zeroes all of the pages assigned to a process and lets it allocate again from the start.

                        1. 2

                          Yeah, when working on migrating JSC’s GC from being stop-the-world avoiding overall perf regressions caused by pretty much everything involved in not being stop-the-world required a huge amount of effort.

                          1. 1

                            Do you happen to have a link or citation of the Project Maxwell technical report from Sun Labs? Thanks

                            1. 2

                              It used to be on the Sun Labs website, but I think it was removed. I have a copy that someone from there shared with me, but I don’t know if I’m allowed to redistribute it.

                              The project was one of my favourite hardware/software co-design projects and it’s a shame that internal politics between the Rock and Niagara projects prevented it from ever being merged into SPARC. They observed that they have a young GC generation that is full of recently-accessed short-lived objects and a cache that is full of recently-accessed objects, so they did in-cache hardware GC in the cache and provided barriers for objects that escape to be tracked and handled by a software GC.

                              1. 1

                                I found https://dl.acm.org/doi/10.1145/1168054.1168056, which can be accessed via a certain raven-clad website. I don’t know if that is ‘the’ technical report; @david_chisnall ?

                                1. 2

                                  No, looks like an unrelated Project Maxwell. The one that I’m thinking of was a spiritual successor to Manchester’s Mushroom project in the ’80s, trying to do hardware-assisted garbage collection on modern multicore SPARC hardware.

                                  1. 2

                                    How about this? It is from sun, and describes a hardware-assisted gc (including the in-cache collection mechanism you mention), and cites a ‘mushroom project’. But it does not seem to include any description of tradeoffs in memory management schemes.

                                    1. 2

                                      That’s definitely the project, but I think that’s the one paper that they wrote about it. They also had a much longer technical report, which I haven’t seen on the Web anywhere.

                        2. 15

                          This is in that class of writing that I mostly agree with but which contains enough mistakes that I feel compelled to nitpick…

                          This statement, however, is right on:

                          The promise of the GC was that programmers would not have to spend time debugging memory leaks and other such issues, but I witnessed deployments where probably as much developer time was spent tuning an uncontrollable GC

                          I’ve seen a whole hardware platform killed in part by GC: Sun’s JavaStation. A poster child for the early Java hype, it ran an OS that was written in Java down to nearly(?) the kernel level. In particular the whole graphics stack was Java. It also shipped with too little RAM (32MB). The result was it spent a lot of time in GC, and when it did so the entire system locked up. You couldn’t even move the mouse pointer. If you dropped in more RAM it just delayed the inevitable and the lockups lasted longer.

                          Cycles are still detectable at runtime and I think even at compile-time

                          Not at compile time, unless it’s some new advance I haven’t heard of. State of the art ref-count systems still cycle-check at runtime.

                          Arguably, this is the reason we have garbage collectors in the first place, because people got hung up on this issue before there was a better solution. Then they just kept building more garbage collectors.

                          GC predates ref-counting. LISP was the first system that used it, and a CONS cell is just two pointers so adding a ref-count to it would be catastrophic for memory usage. I suspect ref-counting only appeared when a program could afford to “waste” several bytes per object.

                          Not really. With an 8-byte counter you will never overflow. So…you know…just expand up to 8-bytes as needed?

                          That’s a big size penalty to add to all your heap objects, and you’ll pay for it in cache misses. Apple’s Obj-C runtime uses purely external ref-counts stored in a big hash table, which seems bad to me but that part of their code has been exhaustively profiled and cycle-level tuned over the decades. I think Swift reverted to in-object counts but I’m not sure. In my own code I’ve just used 32-bit counts.

                          1. 7
                            1. I used compile time cycle checking in my last compiler to detect cases where I needed to generate extra code to handle cycles in my reference counted language.
                            2. GC does not predate ref-counting, since the first paper to describe GC [McCarthy, Lisp] and the first paper to describe RC [George Collins] were both published in 1960.
                            3. GC wastes substantially more memory than RC. In order to get good GC performance you need plenty of free memory. If you are short of memory then the GC will be running constantly and performance will plummet.
                            1. 2

                              GC does not predate ref-counting, since […] both published in 1960.

                              That’s a generous interpretation since McCarthy’s paper was in the April issue of CACM (v3 i4) and Collins was in the December issue (v3 i12) - fully 8 months prior. Also Collins’ paper seems (to me) to be directly in response to the McCarthy paper - making it a clear antecedent, no?

                              Two methods have been described so far in the literature for the solution of this difficulty, each of which has certain disadvantages or limitations. As a result we have been motivated to devise a new method, described in this paper.

                              [edit: swap ‘previous’ for ‘prior’]

                              1. 2

                                The original comment speculated that RC appeared much later than LISP, when memory sizes were much larger. The 1960 GC and RC papers were both descriptions of prior work. I recall that LISP was first implemented in 1959 at MIT. Collins was working at IBM in New York, and his reference counting implementation was in support of “the development of a program for a modified form of A. Tarski’s decision method for the elementary theory of real closed fields”. Collins cites his own 1957 paper about this same program. So I think it is possible that Collins was first.

                            2. 3

                              GC predates ref-counting

                              Incorrect. lisp used a GC, but they explicitly considered refcounting as some other system used before going the gc root.

                              That’s a big size penalty to add to all your heap objects, and you’ll pay for it in cache misses.

                              What? GC tracing clobbers the content of the cache, and a word sized increase would need a lot of small objects in order to make up for such. You pay a lot of cost due to objects becoming sparse in memory as release of dead objects is deferred. If you have a compacting collector you can sometimes reclaim some of the cache misses the GC induces, but it’s not guaranteed.

                              It’s also wrong to say that cons is only two pointers, it’s generally going to be at least three as every LISP from 1.5 onwards supported more than cons and cdr.

                              The reason the LISP 1 opted to make (and invent?) a mark/sweep garbage collector was not to save memory. It was because their object representation only had 6 bits available which was deemed too small to be used as a refcount. The GC that was used essentially got to assume the entire memory of the machine was available to it, that memory was then turned into a single free list. In a modern implementation this would require you specifying a heap size as creating a linked list of 2^^63 nodes would result in a significant launch time regression. The GC itself was to my recollection what we’d call an exact stop the world mark and sweep collector, with the mark bit stored in each object. Another early lisp used side bit tables and compacted but I can’t recall whether “early” means 1.5 and 2, or the 80s. The important thing here is that by definition all dead objects are “wasted”, but the exact marking means that if the free list is empty when an allocation is required if there are any dead objects they will be reclaimed and the allocation will succeed. That said the original lisp also burned 30% of its runtime on GC.

                              The way a garbage collector reduces the runtime performance cost is by over provisioning the heap by very large amounts. Copying collectors are even more explicit as they reserve twice the memory that they actually make available.

                              In all cases the performance of the GC decays exponentially as the proportion of the heap that is live increases.

                              Apple’s Obj-C runtime uses purely external ref-counts stored in a big hash table,

                              No. In almost all cases the refcount is inline in the object. Where lisp decided that 6bits was insufficient for a ref count and went to a GC that consumed 30% of runtime, objc said this some small number of bits is more than enough in all normal code, and falls back to the hash table if the refcount ever hits the maximum value. Swift interacts transparently with objc so maintains the same refcount ABI, which means there is also a count for weak references as well, but I assume that would always be in a side table because how frequent can that possibly be?

                              Now I’m not going to say the memory used in a refcount is free, it clearly isn’t - that’s why objc uses a side table for large retain counts, however it is absolutely wrong to claim that GC doesn’t add an even larger memory hit.

                              1. 2

                                before going the gc root.

                                I see what you did there

                                1. 1


                                2. 2

                                  Thanks — you clearly know a lot more about LISP than I do. I was unaware of earlier programs using ref-counting.

                                  What? GC tracing clobbers the content of the cache, and a word sized increase would need a lot of small objects in order to make up for such.

                                  I was comparing the overhead vs. using a smaller ref-count field, not vs GC. Apple apparently found that the overhead was bad enough to go to the alternative of external ref-counts.

                                  Swift interacts transparently with objc so maintains the same refcount ABI

                                  Only for classes explicitly declared @objc … by default, Swift classes have their own somewhat different ABI.

                                  1. 2

                                    Thanks — you clearly know a lot more about LISP than I do. I was unaware of earlier programs using ref-counting.

                                    You can find a lot of information about it - including the original papers and operating instructions - online, a highly recommend reading at least the “memos” (which seemed to be an MITism for “brief design doc” rather than a post-it :D). The bigger “how to use LISP” docs are also interesting as they also give some high level implementation info.

                                    They’re also kind of fun to read, because of course this is by definition before there was real data structure thingy (docs? theory? I am failing to recall the word I want here). So the description of a linked is charmingly Baroque (yet my brain furbished me with this word).

                                    I was comparing the overhead vs. using a smaller ref-count field, not vs GC.

                                    Ah, rereading I see how I misinterpreted it. An 8 byte refcount is indeed a tiny bit excessive. WebKit uses 32bit pointers because going smaller is not really useful in the general case - IIRC StringImpl (which is the actual refcounted data backing the String struct you’d normally use) stores a bunch of info in the 4 bytes that also contain the refcount.

                                    Apple apparently found that the overhead was bad enough to go to the alternative of external ref-counts.

                                    In the context of the above, I think what you were saying is “A full word for the refcount was too expensive to have inline”, is that correct? if so I apologize for again misunderstanding.

                                    Apple apparently found that the overhead was bad enough to go to the alternative of external ref-counts.

                                    For the full arbitrary size refcount, which most objects never need, but my comment was in the context of comparing object size w/o a count in a GC environment, which was apparently not what you were comparing

                                    Only for classes explicitly declared @objc You are correct, I was confused by RefCount.h which is both large and references storing the refcount in the isa pointer

                                    It turns out that the swift object class for objc overrides retain/release, and if it can’t prove that something is not an objc object it calls a helper function instead of inlining the refcount logic.

                                    1. 2

                                      A long time ago I read Ivan Sutherland’s 1961 Ph.D thesis about Sketchpad, the first interactive drawing program. It was an amazing system for its time and he used some really clever data structures.

                              2. 14

                                This is a pretty bad take. Mark & sweep garbage collection (GC) and ref-counting (RC) each have their costs and benefits and this article basically ignores the costs of RC and the benefits of GC.

                                I spent years in the early 2000s dealing with the reference cycle leaks and use-after-frees in Gecko’s large ref-counted C++ codebase. It wasn’t my job to improve Gecko’s memory management, but the limitations of ref-counting were forever present when trying to add or improve functionality.

                                I was intrigued to learn of the Oilpan project in Chromium which replaces much of the GC in Chromium’s Blink layout engine. It allowed them to eliminate many potential use-after-free vulnerabilities and reference cycle leaks, but perhaps most intriguingly it allowed for improved performance in real-world cases.

                                A mark & sweep GC will generally have to do more computation than RC, but the key with GC is that you get to choose when to do that work and how long to spend doing it.

                                In a browser layout engine most allocation and computation occurs as a page is being initially loaded, laid out and rendered. Then when the user navigates away the layout engine has to free all of the objects associated with that page.

                                Unfortunately when navigating from page A to page B the a browser using RC will be tearing down the huge tree of objects representing A while its busy building the huge tree of objects representing B. Some browsers avoid this by hanging on to a reference to the previous page’s root objects until after the next page has been presented to the user.

                                Because RC is all done locally you can’t stop a it half way through and restart it later. Simply walking a large tree of objects, decrementing atomics and freeing memory can easily cause a pause long enough to drop frames or glitch audio. Browsers already have their own schedulers so that they don’t for example invoke setTimeout handlers right when they need to render then next frame. With incremental GC Blink can avoid user-visible performance problems like dropping frames that would be unavoidable with RC, even though GC has to do more work.

                                The trade-off for this improved user-visible performance is that more compute is done and more memory is consumed while the previous page’s objects are held in memory. But for many cases that trade-off is worth it.

                                1. 2

                                  RC in C++ using std::shared_ptr is a different proposition than RC in a memory safe language that is designed for safe and seamless RC. In the latter case, the compiler knows about RC and can do a lot of tricks to minimize cost.

                                  High performance GC uses a lot of tricks to achieve performance. If the same amount of work is put into optimizing RC, using an optimizing compiler that emits different kinds of refcount management code in different situations, then you can do far better than what’s possible using a smart pointer class in C++.

                                  In C++ using std::shared_ptr, it’s true that RC is all done locally and you can’t stop it halfway. That isn’t true in a compiler supported and optimized RC runtime. You can use lazy RC, for example. Deferred RC is also nonlocal.

                                  Furthermore, with the approach I’m describing, you can eliminate the need for atomic operations for RC update in multithreaded code. There are several ways to do lock-free RC.

                                  For more information, the Wikipedia article is a start but there’s a lot more information about techniques in papers that you can find with a Google Scholar search.

                                  1. 2

                                    RC in C++ using std::shared_ptr is a different proposition than RC in a memory safe language that is designed for safe and seamless RC. In the latter case, the compiler knows about RC and can do a lot of tricks to minimize cost.

                                    Note that this is an implementation issue and not a language one. This kind of thing is one of the motivations behind the work to add a C/C++ MLIR dialect in between the clang AST and LLVM IR: it would allow the reference counting operations on std::shared_ptr to be carried through in abstract form, the same operations that are done on Objective-C reference counts applied, and then be lowered to the real version later.

                                    The behaviour of everything in the std is effectively part of the language and so compilers are free to optimise assuming that they understand the semantics of these interfaces.

                                    1. 1

                                      That sounds like a win for most programs. Unfortunately, it also suggests that the RC code I wrote using boost::intrusive_ptr won’t benefit from the optimization. Also, what does the C++ code for the standard library look like when these optimizations are available? Are there magic pragmas present in the source code that provide hints to the optimizer? Can code outside of the standard library use these pragmas?

                                      1. 1

                                        The idea is that this will be done on the call side, so the flow will look something like this:

                                        1. In the source code, you’ll write something that uses std::shared_ptr and implicitly contains calls to the copy constructor and move constructors for std::shared_ptr.
                                        2. In the high-level IR, these will be lowered to built-in operations for incref and decref on a shared pointer.
                                        3. During early optimisation (and after some inlining) these will be optimised so that, for example, incref, incref, decref, decref sequences become simple incref, incref sequences and if one shared_ptr has a lifetime that outlives another then the inner one can be completely elided.
                                        4. In lowering to LLVM IR, the incref instructions will be lowered to calls to the copy constructor and decref instructions to calls to the destructor for std::shared_ptr.
                                        5. Late inlining will then inline the real implementations of the standard library implementation.

                                        This means that the optimisation is completely oblivious to how the shared pointer is actually implemented, as long is it follows the semantics. The same sort of thing can be done with standard collections if you do an insert and then a find and dereference the iterator, for example.

                                        1. 1

                                          Does this mean that the compiler has hard coded knowledge about std::shared_ptr specifically, and any other reference counted shared pointer implementation that doesn’t have this exact name cannot benefit from these optimizations?

                                          1. 1

                                            Yes, in the same way that it already has specific knowledge of functions like printf and can turn them into calls to puts if it can evaluate the arguments to constants during optimisation. Compilers may also provide annotations to allow other calls to opt into this kind of thing, as they do with some other standard library functions.

                                    2. 1

                                      What memory safe languages take a different approach to RC? I understand that in ObjC and Swift function local reference increments and decrements can be elided by the compiler, but where is deferred/lazy RC in use? That seems pretty interesting.

                                      1. 1

                                        The “RC pauses” that several commenters are mentioning are just the same pauses seen in C++ and Rust programs when a hierarchical data structure is recursively freed. Not a problem in most cases, but can be significant in video game programming because an expensive call to free() could be undesirable in the once-per-frame loop.

                                        I’m not aware of a language that uses lazy RC, and I’m skeptical about it because it adds cost, especially there are extra memory requirements to keep track of where you left off while releasing a large data structure. In the cases where you need this level of control over latency, you won’t use a high level reference-counted language, you’ll use a low level language like C++ or Rust where you have fine level control over memory operations.

                                        I mentioned deferred RC as a “non local” implementation technique, different from the totally local code emitted by a C++ compiler when you use std::shared_ptr. Deferred RC elides reference increments and decrements based on a global view of what’s happening within a function. It’s probably the same as what you say ObjC and Swift are doing.

                                    3. 2

                                      A mark & sweep GC will generally have to do more computation than RC, but the key with GC is that you get to choose when to do that work and how long to spend doing it.

                                      ahahaha that is my general opinion on the advantages of RC over GC. Or if not choose, then at least predict. XD

                                      1. 7

                                        With RC you have to make all decisions locally whereas with GC you have to make them globally. There are advantages to each, but in a sufficiently complex system having some kind of centralized decision making is advantageous.

                                        It’s vaguely analogous to cooperative vs preemptive multitasking. Cooperative multitasking requires less computation and the interactions between tightly coupled components can be made to perform better, but we find that it’s hard to build complex systems that actually perform as well in practice, so we pay the cost of preemptive multitasking in almost all operating systems.

                                        1. 4

                                          Yep. I’m used to thinking about it in terms of video games where there’s only two times you worry about latency: Either on loading/switching levels, or every single frame. So in that case you usually want to keep the control local, because there’s few out-of-the-box GC’s that give you enough global control.

                                          It’s the exact opposite of the browser situation, you can do whatever you feel like when unloading one scene and loading another, but what you can’t do it is handle it in the background ‘cause there’s really not much time between rendering deadlines.

                                          1. 1

                                            In that kind of case I guess you need to be careful to hold onto all references until the end of the level to avoid inadvertently causing a RC pause. In the GC world you get to be more explicit about when you do the memory management work which might be advantageous in the games context.

                                    4. 10

                                      This is a bad article that lacks any data or evidence of concrete experiences. I’m inclined to mostly dismiss it, but it keeps popping up for some reason.

                                      Simply asserting that you’re not going to provide any data doesn’t change that!

                                      I’ve already stated I’m not going to do benchmarks.

                                      Lots of comments here on it, basically none of which support the article: https://news.ycombinator.com/item?id=32276580

                                      I also don’t think that GC and ref counting are directly comparable, because the interface to native code is quite different. A big issue is that reference counting is more “littered all over your code” than GC is.

                                      With ref counts, you have to think about every assignment and parameter binding. With tracing GC, you have to annotate all your roots (global and stack).

                                      Here is a somewhat flamey comment which makes that more concrete, i.e. having to put a tracing GC on top of bad ref counting in practice:


                                      Case 3.: You don’t “want” the constraints of Case 2, but are in practice forced into them due to a huge, poorly-educated developer base being incapable of writing correct refcounted code or even knowing what a weak pointer is.

                                      To mitigate this, we created a “retain cycle detector” — basically a rudimentary tracing GC — that periodically traced the heap to detect these issues at runtime and phone home with a stack trace, which we would then automatically (based on git blame) triage to the offending team.

                                      1. 7


                                        I’ll just leave this here.

                                        (From here.)

                                        1. 3

                                          Without a comparison to something else this is not really useful. For example, how long would the same apps take with explicit malloc/free? Why aren’t the GC benchmarks taking more GC time - is RC that efficient? How does it compare to a conservative collector?

                                        2. 4

                                          All the arguments against garbage collection go away when you split your program up into discrete actors with their own heaps, accept that everything is a distributed system these days, has been years, and act accordingly.

                                          [ taps noggin ] Can’t have stop-the-world pauses when you quit pretending there’s a single world.