1. 7
  1.  

  2. 1

    Garbage collection: the never ending source of performance problems.

    Or, for the optimist: Garbage collection: never ending opportunities for optimization.

    1. 3

      It’s not really the garbage collector’s fault in this case. Making the same mistake in C++ or Rust—failing to reuse a vector—would have stressed the allocator instead. Memory use wouldn’t have ballooned like it did here in Go, but it still would have burned lots of CPU.

      Allocating fresh objects in a hot path isn’t wise in any language.

      1. 1

        In C++ or Rust, I doubt the allocation time would balloon from 8s to 151s. There still would have been room for improvement through reuse, but making that optimization would not have been as critical.

        1. 1

          Do you have any reasoning to back that up? Because my intuition and experience says the opposite. This is a throughput problem, and garbage collected systems generally have higher throughput.

          1. 1

            Just basing it on his comment that the GC, in particular, was eating up CPU. If it was allocation overhead it presumably would not show up as GC eating up the CPU.

            Also, what do you mean by “garbage collected systems generally have higher throughput”? And by the claim that it’s a throughput problem? It didn’t appear to be memory bandwidth limited. I’m not sure what throughput you’re talking about.

            1. 1

              In a garbage collected system, what handles deallocation?

              1. 1

                The garbage collector, of course, but garbage collection is much more expensive than deallocation in the absence of a GC.

                1. 1

                  And what makes you think a GC is slower?

                  1. 1

                    Not necessarily slower, since marking can be done in another thread, but necessarily more expensive because there’s more work to do. In this case, it seems likely to be slower too, since optimizing away the GC reduced the runtime from 155 to 8s. I understand that your saying this could be mostly due to the allocator, but that seems unlikely to me. I can’t prove it without dissecting it, but GC takes more CPU, and this was apparently CPU bound.

                    The only way I could see it being just as bad without GC is if the bottleneck is zeroing the arrays on allocation, or something. Maybe that’s indeed what’s going on, I don’t know.

                    1. 1

                      So, I’ve been hinting that this assumption is wrong:

                      garbage collection is much more expensive than deallocation in the absence of a GC

                      I admit I’ve been hinting rather than explaining largely because I’ve been on 2 international flights today and I haven’t opened my laptop.

                      Garbage collection is not more CPU expensive than a manually managed allocator. Which type of operation is usually more expensive, multiple individual transactions or one large batch transaction? Of course, the multiple individual transactions are usually more expensive. Well, that’s what makes GC efficient, it batches deallocations together into a massive operation. This is good for throughput, but bad for peak latency and peak memory usage.

                      If it was allocation overhead it presumably would not show up as GC eating up the CPU

                      In a GC environment, yes it does. In a GC environment, GC is allocation overhead. When you see heavy GC load, that could mean two things:

                      1. you’re allocating a ton of objects that the GC must then deallocate (overhead)
                      2. you’re running low on memory, causing the GC to run too often

                      In a non-GC environment, allocation overhead would instead show up as calls to malloc and free, or equivalent. All allocations are accounted by the memory management system, whether it’s manually managed or GC.

                      For whatever reason, people assume malloc and free are fast. They aren’t. They go through a lot of accounting logic to keep track of what memory is in use. They update data structures, they take locks, they do all sorts of stuff to make sure that EVERY caller of malloc or free sees a consistent snapshot of managed memory. A GC pause avoids all of that overhead.

                      And now I’m literally starting to pass out at my laptop, so if this explanation is incomplete, I’ll have to sort that out in the morning.

                      1. 1

                        I think I understand the point you’re trying to get across, but any half-way decent allocator would push the array’s heap memory onto a thread-local-storage free list and then almost immediately pop it for the next loop. This certainly isn’t as free as reusing the array (especially if it has to zero the memory), but it’s much cheaper than tracing every reachable object on the heap and then deallocating anything that is not reachable. Another reason the GC is almost certainly much more expensive, is that it is not run deterministically every time an array is freed, and so it’s very likely to generate (and eventually free) a lot more garbage memory, rather than reusing the memory from the last loop.

                        1. 1

                          You’re correct about the thread local pool, and nevertheless I believe your intuition that it wouldn’t be a critical optimization is wrong. I’ve seen this exact situation before in C++, profiling showed lots of time in vector resize, and the optimization was critical.

                          GC has to scan large sections of the heap, but when most of the heap is garbage it’s tremendously efficient.

                          Your criticism about memory efficiency is spot on though. GC isn’t memory efficient, especially in this situation. It prefers to run when it knows it will be collecting lots of garbage, so by nature GC systems have a lot more memory overhead, and need that overhead to stay CPU efficient. This is why iPhones famously have significantly less memory than Android phones, yet still manage to be smoother and snappier.

                          There are substantial negatives to GC, like memory overhead and peak latency. But throughout isn’t one of them. If your GC is working extremely hard and it isn’t memory constrained, it’s your fault for allocating too many garbage objects, which is a problem in both GC and non-GC systems.

                          1. 1

                            I think we’re on the same page. I am perfectly happy to concede that reusing an array can be a critical optimization in C++/Rust. The claim I was trying to make is that such an optimization is almost certainly more critical in the presence of GC.

                            Anyways, it’s been an interesting discussion, so thank you for that.