1. 54
  1.  

  2. 8

    Great work okpok! I love that you had a question and ventured out to figure out the answer.

    1. 2

      Thank you, I appreciate the kind words!

    2. 6

      I’m definitely not an expert on garbage collection and haven’t thought it through, but the first thought I had was how useful this would be as an “infrequent access” GC’d heap in addition to a normal GC’d heap. So you allocate things in a normal mark-and-sweep heap, and if a thing hasn’t been dereferenced in a certain number of sweeps then it’s worth the overhead of moving it to the Infrequent Access GC’d heap. That way the normal “hot” GC doesn’t have large pauses checking everything, and you only pay the O(log32(n)) cost of dereference for things you know aren’t being accessed frequently.

      Again, I know very little about GC and it’s likely this is already something Generational GCs do. But I can think of many data services I’ve personally written where it seems like that’d be an almost ideal way to handle things (assuming a GC at all). One could imagine a “stow” keyword in a language to explicitly set aside objects you wanna keep around but probably won’t be used for a while.

      1. 7

        Thank you for commenting! I was trying to fully remove mark-and-sweep from the main thread, while also providing compaction/defragmentation. I wanted to get down to small O(1) pause times, even at the expense of making dereferences of handles more expensive, because I felt that this would be beneficial for the ergonomics of a language. (Once one accepts the cost of the dereferences being a little more expensive, they are distributed evenly throughout the program, and GC pauses essentially go away / are free). I think this is a sensible tradeoff. There may be other useful tradeoffs to make, where perhaps mark-and-sweep still exist to some degree on the main thread, as you’re suggesting, but I imagine present generational garbage collectors may already hit the mark (no pun intended) there.

      2. 3

        What are the benefits wrt another high performance GC? O(1) allocation and de-allocation is the norm.

        1. 7

          Thank you for asking. This implementation provides short O(1) pause times, regardless of working set size, and it also compacts. Maybe such things do already exist, but I couldn’t find such things when I searched as I was thinking about this project, so I thought this would be new. I’m happy to compare to another implementation if you could point me to one.

          1. 3

            I see. Given the expensive read barrier, I think there are other ways to achieve similar characteristics with better trade-offs (I can find references tomorrow, but it is a bit late here, sorry). However, your design seems to make a lot of sense for a compacting database (the expensive read barrier can be factored in the cost to access external pages, storage overhead might be a problem though).

            Small question: when GC thread is active, what happens if the main thread fills the region before than the GC can complete?

            1. 1

              Thanks I’ll be happy to read them. Regarding your question, I use the term “region” in the code to refer to a range within the CB/ring-buffer. That region is allocated by the main thread for use by the GC thread, and then the main thread never touches it. If you’re referring to the CB/ring-buffer itself, as in “what happens if the main thread fills the CB before the GC completes?” then the answer is that the main thread will resize the CB upwards to the next power-of-2, copy all of the contents of the old CB (including the still-being-written GC consolidation region), and then when the main thread observes the GC to have completed, it will copy forward the data from the old CB’s consolidation region to the equivalent range in the new CB. In this way, the GC thread only knows about the CB that was given to it during the start of the GC/consolidation, and the main thread is responsible for the shift of the CB to a larger one.

              Thanks! –Dan

          2. 2

            Yeah, it looks like an exercise to write such GC. Other pauseless GC’s has similar characteristics and arguably cheaper in many cases. This implementation for lookups requires a hash table on-the-side and most likely requires a read-barrier (when rewire objects moved from zone A to B to C).

            1. 3

              Thanks for taking a look! Yes it does require a read-barrier for the objects which have shifted locations. I’d be interested if you could link me to the best implementations you know of.

              1. 3

                Read what you did led me immediately to Azul’s Pauseless GC. Both does tons of work on read side to ensure GC thread can finish marking and compacting in predictable way. In their case, I think they trap the pointer read to outdated pages to update the reference (in compacting phase).

                Edit: thinking through their marking phase and yours (from my understanding, your zone B is for marking). One thing I don’t quite understand is how you can make sure mutations in zone A has no pointers back to zone B / C objects, hence, you can do marking phase correctly without consulting zone A?

                1. 2

                  I cheat by not using pointers but instead handles. Each allocation is just given an increasing integer ID, which is “dereferenced” indirectly through the O(log32(N)) objtable. This should be a reasonably shallow data structure, but is definitely not free, so adds overhead to the dereferences.

                  1. 1

                    Yeah, by pointers, I meant handles. I guess whenever zone A tries to references an object already moved to zone B, you copied it out of zone B back to zone A? Otherwise I cannot see how you can avoid in marking phase missed some objects referenced from zone A?

                    1. 1

                      Yeah, the vision is that all objects are either “small” (e.g. tuples under some arity [envisioned, but not yet present in this implementation]) and are copied from either B or C into A when they need to become mutable, or they are larger (e.g. maps/collections) but are persistent objects which are mutated via path-copying of the path-to-mutated-leaf, where such copied path is written out in A and is similarly considered “small”.

                2. 3

                  If you haven’t seen it already, you might find some of the work on Shenandoah interesting - specifically how they’ve optimized forwarding pointers/read barriers.

                  https://developers.redhat.com/blog/2019/06/27/shenandoah-gc-in-jdk-13-part-1-load-reference-barriers/ https://developers.redhat.com/blog/2019/06/28/shenandoah-gc-in-jdk-13-part-2-eliminating-the-forward-pointer-word/

            2. 2

              This looks interesting – a couple questions regarding dereference performance:

              • Whence the base (32) of the log? Is it something inherent to the design, or could it be, say, 16 or 64 instead? (And if so, with what other performance implications?)

              • Given O(log_{relatively-big-number}(n)), I’d expect that in practice the actual performance cost probably ends up dominated by a constant term – at minimal values of n, what’s the expected cost relative to a direct pointer dereference? (With or without whatever read/write barriers would be needed.)

              1. 3

                Thank you!

                First bullet point: This comes from the AMT implementation using a 32-way branching factor. (See STRUCTMAP_LEVEL_BITS in structmap.h) It can use less or more, too, it would still work, but the tradeoff comes from each of the structmap levels taking up 32 (or 16 or 64) 64-bit cb_offset_t elements. So you need to strike a balance for the branching factor (to keep lookup paths shallow), while trying to keep the structmap levels “small” to keep the initialization cheap. I expect the ideal branching factor to be somewhere between 32 and 128, but maybe it’s actually much higher, or maybe there are better structures than the structmap for the objtable implementation.

                Second bullet point: So each ObjID “dereference” has to pointer-chase down log32(N) structmap levels until it resolves, where N is the highest ObjID allocated (but could be made more efficient to be the number of ObjIDs which are live). So whereas a single raw pointer would traverse a pointer one time, an ObjID lookup would traverse an additional O(log32(N)) - 1 times.

                1. 1

                  On the second point, I meant more in a where-the-rubber-meets-the-road, non-big-O sense – I may be looking at completely the wrong thing, but is structmap_lookup() effectively the dereference operation? If so, it looks like there’s a potentially-significant amount of extra work in there (various arithmetic and bit manipulation, conditional branches, etc.) – I was wondering more about the overhead incurred by that kind of stuff (or whatever analogous code there is in the actual dereference path if that’s not it).

                  1. 1

                    Yeah, it depends on what you consider significant. As I coded this, I considered the bitwise operations essentially free, but the memory accesses to be expensive. This is a simplification, but I think it’s a good one because I expect the runtime to be dominated by stalls related to memory accesses as it chases the structmap’s internal pointers/offsets through the CB/ring-buffer.

              2. 1

                Something people may also be interested in: TSLF, a constant-time allocator. Paper is here; there’s also a slideshow and another slideshow.

                1. 1

                  I loaded it up and will take a look, thank you.