1. 25
  1.  

  2. 9

    The blog post glaringly omits defer, which also has a runtime cost compared to straight function calls, but it’s far better than finalizers are. Finalizers are only appropriate when the objects are subject to complex lifetime graphs where you actually need the GC to figure out when they’re dead. For objects with simple lifetimes, just defer C.free_allocated(ptr).

    1. 6

      Finalizers are only appropriate when the objects are subject to complex lifetime graphs where you actually need the GC to figure out when they’re dead.

      Finalizers aren’t guaranteed to be called even when the GC reaps the value, so they don’t actually help in this case, except as a best-effort optimization. Said another way, there’s no reliable way to know whether or not a value is “dead” or not! Annoying, but true.

      1. 12

        Finalizers ARE guaranteed to be run IF the GC reaps the value. What is not guaranteed is that any given dead value will ever actually ever be reaped.

        And that is absolutely fine. Go is doing the right thing here.

        If what you want is guaranteed destruction before program exit then what you do (and this is language-independent … I don’t actually use Go) is:

        • add the object to a weak (doesn’t cause the object to not be GCd) collection of objects to be destroyed at exit

        • add a finalizer to the object that removes itself from the at-exit collection, as well as any other actions

        • at exit, call the finalisers for any objects still in the collection

        NOTE if you do this then you have to be very careful because some of the objects may refer to each other, and there is no way in general to BOTH destroy only objects that no other object is referring to AND to make sure all objects are destroyed.

        Not running GC at program exit is the right choice, because in fact you would need to run GC repeatedly until a run in which no new objects were reaped.

        In fact it is often beneficial for short-lived programs to never GC at all.

        Try making a library with…

        void free(void *p){}
        
        ... and LD_PRELOAD it into a regular C program that uses malloc/free such as gcc. You will find that on a machine with a decent amount of RAM small compiles will actually go faster.
        

        Try doing the same thing with Boehm GC compiled in transparent malloc/free replacement mode, so that free() is a NOP (as above) and malloc() does GC. Virtually any standard C/C++ program you try it on will 1) still work fine, and 2) run faster.

        1. 5

          Finalizers ARE guaranteed to be run IF the GC reaps the value. What is not guaranteed is that any given dead value will ever actually ever be reaped.

          I think this is the best summary of what’s happening, thank you, and my apologies for confusing the issue earlier.

          If what you want is guaranteed destruction before program exit then … add the object to a weak (doesn’t cause the object to not be GCd) collection of objects to be destroyed at exit

          AFAIK there is no way to construct a collection with this property in Go, but if you know of a way I’m happy to be educated!

          1. 2

            Finalizers and weak maps are sort of different interfaces for equivalent capabilities. You can make a weak map using finalizers, but it’s a pain in the butt. See https://github.com/golang/go/issues/43615

          2. 2

            Virtually any standard C/C++ program you try it on will 1) still work fine, and 2) run faster.

            Frankly, I would be surprised to see that happen. GC can definitely beat malloc for interesting workloads, but although bdw is an excellent piece of engineering, it is at an inherent disadvantage compared to a proper gc, considering that it lacks precise layouts and cannot move. Historically, malloc implementations were quite bad, whereas bdw’s allocator was rather decent; nowadays, I think the mallocs have largely caught up.

            1. 1

              No doubt libc malloc implementations are a lot better now than when I first tried this experiment in the late 90s. From memory, the last time I did the experiment was 2017 when I was working on an LLVM-based compiler for OpenCL. For compiling typical OpenCL kernels it was definitely better to use a simple bump-pointer malloc(), disable free(), and free all the RAM in one go after compiling the kernel.

              In my experience, most objects that contain pointers are mostly pointers, and finding and parsing the relevant layout information takes longer than just assuming that everything might be a pointer. On anything where the heap is small compared to the address space (e.g. all 64 bit machines) a simple bounds check (size_t)(ptr-heapStart) < heapSize instantly eliminates virtually all non-pointers.

              Being able to move / compact objects is an advantage, for sure, but C malloc() libraries can’t do that either, so BWDGC is not at a disadvantage relative to malloc().

              1. 1

                Mallocs typically use freelists that have good temporal locality: you can free an object, and soon after, when you allocate another one, you can reuse the memory from the first, while it is still hot in cache. Good tracing gcs generally can’t do this, but make up for it with good spatial locality, which mallocs can’t typically provide. Bdw provides neither.

                Bumping and never freeing is good if most objects live long, but this is unlikely to be the case for graph mungers such as compilers, so I find your result somewhat surprising. See e.g. shipilev on hotspot’s version of this. Are GPU kernels are usually less heavily optimised than typical CPU code?—not my area of expertise here.

                most objects that contain pointers are mostly pointers, and finding and parsing the relevant layout information takes longer than just assuming that everything might be a pointer

                Conservative has its own issues—in particular, you have to identify where objects start, whereas you probably would not otherwise have needed to support interior pointers. And there are middle grounds. Hotspot will partition all the fields of a class into those that are pointers and those which are immediate, placing all the pointers contiguously, so your ‘layout information’ is just the offset where the pointer fields start. (And note you need some layout information anyway to find out how big the object is, so this is not a huge imposition.) Implementations of ocaml and lisp use tagged pointers, where the low bits of each word are a tag, which tells whether the high bits are immediate or a memory address, so this information can be gleaned latently (in lisp’s case, obligatorily, since it is dynamically typed). (This is why ocaml uses 62-bit or 30-bit integers—a language deficiency; they should have transparently upgraded to big integers, like lisp.)

                Also note that tracing tends to be bound by latency/memory parallelism, so there are some cycles to burn parsing layouts; obviously you’d rather not have to store them, but.

                1. 1

                  Mallocs typically use freelists that have good temporal locality: you can free an object, and soon after, when you allocate another one, you can reuse the memory from the first, while it is still hot in cache.

                  Modern malloc try very hard not to do this because it makes it very easy for attackers to turn a use after free bug into a stable exploit. In snmalloc, we treat the free lists as a queue and add some randomisation so that it’s hard for attackers to predict where a freed object will be reallocated. Other modern allocators have similar mitigations (though with variations to fit better with their internal data structures).

        2. 3

          The overhead of most defer calls is significantly better after the implementation of open-coded defers in https://golang.org/cl/202340, but I’m not certain if this particular call benefits.

          1. 1

            Came here to say this. 🍻

            1. 1

              Here are my results from hacking in an allocate-defer benchmark to his code:

              goos: linux
              goarch: amd64
              pkg: example/cgotest
              cpu: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
              BenchmarkAddition-8            	19435881	        52.43 ns/op
              BenchmarkAllocate-8            	10378263	       110.5 ns/op
              BenchmarkAllocateDefer-8       	10262841	       114.2 ns/op
              BenchmarkAllocateFree-8        	20076832	        60.31 ns/op
              BenchmarkAllocateAuto-8        	 1000000	      1094 ns/op
              BenchmarkAllocateAutoDummy-8   	 1000000	      1142 ns/op
              BenchmarkAdditionGo-8          	1000000000	         0.2452 ns/op
              PASS
              ok  	example/cgotest	7.603s
              

              It seems to be way better than finalizers, but defer is still 3-4% slower than calling it manually. Not a huge loss, but it’s noticeable with this small workload.

            2. 5

              He added a comment to the post saying the overhead is due to “a sanitizer”, and says he updated the post, but I don’t see anything about sanitizers in the post.

              I would expect the overhead of setting a finalizer to come from adding an entry to a global hash table in the GC; not trivial but less than what the OP found.

              1. 3

                Since it was still expensive even with the GC turned off, does that imply that the cost is in calling runtime.SetFinalizer? In either case, why does the finalizer version spend more time in runtime.cgocall than the non-finalizer version? I would think the cgocall stuff is essentially the same (the call to C.free)?

                1. 4

                  A dominant part of the cost does seem to be calling runtime.SetFinalizer. I modified his benchmark code to create a new dummy function that called SetFinalizer without actually doing any cgo malloc() and free(), and it had about the same time overhead as the version with finalizers that did make cgo calls. Profiling that version shows that over 50% of the time is spent in SetFinalizer() and things it calls (primarily runtime.addspecial()), About 14% goes to allocating objects (mostly due to it triggering GC) and about 14% goes to background GC; total GC seems to be about 27%, and a lot of that seems to go to running finalizers (probably around 21% of the time), even though these finalizers don’t do anything, A surprisingly large amount of time goes to locking and unlocking things in the process of all of this, it looks like about 20% of the total runtime. This time breakdown is obviously a worst case, since this code does nothing but make simple objects, set finalizers on them, and throw them away.

                  1. 3

                    70% of the code is in runtime.cgocall, which a quick search suggests is the trampoline for calling C. I wouldn’t be surprised if the profiling stops at the language boundary, so it’s likely that 70% is spent in the C code. Which makes the 10x overhead somewhat dubious and possibly caused by some measurement error.

                    Finalisers are probably stored in a hash table with a flag in the object header indicating that one exists (a language like Java can just make them methods because you can’t dynamically change whether a type has a finaliser), but even the cost of inserting into a hash table shouldn’t be that high, unless it’s the cost of initialising the hash table because it’s the very first one. Either way, I strongly suspect that there’s some other factor at play for this overhead.

                  2. 3

                    Reasonably, Go programmers are likely to insist that the memory allocated from Go be released automatically. Thankfully, Go has a mechanism for this purpose. You put the C data in a Go structure which is subject to automatic garbage collection. Then you tie the instances of this structure to a “finalizer” function which will be called before the Go structures is garbage collected.

                    The author misunderstands finalizers: they’re not guaranteed to be called before a value is GC’d.

                    The finalizer is scheduled to run at some arbitrary time after the program can no longer reach the object to which obj points. There is no guarantee that finalizers will run before a program exits

                    https://pkg.go.dev/runtime#SetFinalizer

                    1. 7

                      A finalizer is guaranteed to be called before its object is GC’d, if the object is GC’d. What that statement means is that the process may exit before GC runs or before GC collects that object. (Otherwise the process would be required to run a full collection before calling exit, which is pointless for anything but finalizes.)

                      Quite often this doesn’t matter because the resource being cleaned up by the finalizer is already cleaned up by the process exit, such as a heap block or file handle.

                      1. 7

                        The key point, which is largely omitted in the docs but which is why the docs are slightly confusing, is that object deallocation is not observable in the abstract machine. Once an object is unreachable then you have no way of referring to it. If it does not have a finaliser then, at some point, as an implementation detail, the runtime may reuse its memory. Because this operation is invisible in the abstract machine (by design, it simplifies a lot of things), it doesn’t make sense to say that a finaliser runs between the object becoming unreachable and behind deallocated: the second event is not observable and is not guaranteed to happen (it’s valid to just leak unreachable objects as long as doing so doesn’t crash the program and that’s the best strategy for a lot of short-lived programs).

                        From an implementation perspective, the finaliser is called before the object is deallocated by the GC, but the docs try to avoid leaking the implementation into the language definition.

                        1. 3

                          Agreed. I think the only guarantee is that

                          • the finalizer is called at most once [?? might be untrue if the finalizer resurrects the object]
                          • when it is called the object’s contents are still valid
                          • but the object is unreachable by other objects
                        2. 3

                          As far as I understand, there is no guarantee that any finalizer(s) attached to a value will be executed before that value is GC’d by the runtime. I’m happy to be corrected, if you can point me to something that suggests otherwise!

                          edit: The ultimate tl;dr here is that finalizers != destructors

                          edit 2: I think I might be confusing the discussion. A finalizer will indeed be executed before its corresponding value is GC’d (in a certain sense) — but the point I’m trying to make (very ineffectively) is that there is no guarantee as to when a value is actually GC’d in that sense, and so applications can’t treat finalizers as anything other than best-effort optimizations.

                          1. 2

                            I mean, we can look up how this stuff works.

                            1. 2

                              Yes, and my summary is based on my parsing of that code. (Which may be mistaken!)

                              1. 1

                                Unless you suspect that the documentation is flat out wrong, I’m not sure why you’d read the code instead? :)

                                The author misunderstands finalizers: they’re not guaranteed to be called before a value is GC’d.

                                I’m finding the use of “guaranteed” to be difficult here. I think it is guaranteed with caveat. If the finalizer goroutine completes, then the finalizer did run before the memory was freed by the GC. (same source, first paragraph)

                        3. 3

                          There is no guarantee that finalizers will run before a program exits

                          That seems like a disclaimer for a different case that exit/abort/crash doesn’t run them, but this doesn’t mean they won’t (eventually) run in a continuously running program.

                          Even in languages with deterministic destruction (C++, Rust) destructors aren’t guaranteed to run before program exits, because anything can call exit or quit the main thread without giving other threads a chance to unwind.

                          1. 1

                            My understanding is that when a finalizer is executed is fundamentally non-deterministic. It is guaranteed to not execute before a value is GC’d, but it is not guaranteed to execute after a value is GC’d.

                            edit: I am not expressing my point effectively; see this comment.

                            1. 2

                              This is not how I understand the docs. The logic is that if a value is GCed, it will definitely run the finalizer associated with it. There are just many caveats/quirks/footguns that may indefinitely delay or prevent a value from being GCed.

                          2. 1

                            This sounds a bit like how “drop” isn’t guaranteed to be run in Rust but you can, outside of for safety reasons, pretty much assume that it does. GC languages tend to have even more issues with this because when a value is free’d is not deterministic and, for any implementation, there may be a number of reasons why a value may not ever be collected.

                            1. 3

                              I’m surprised that drop isn’t guaranteed to run, I always assumed that types that implement Drop are treated as linear. Do you happen to know of any situations where it might not run?

                              1. 2

                                If there’s a cycle, or mem::forget.

                                https://cglab.ca/~abeinges/blah/everyone-peaches/

                                This whole “you can’t assume drop will run to uphold safety invariants” came to a head with the leakpocalypse and the “Pre Pooping Your Pants” post (now Pre Pitting Your Peaches).

                          3. 1

                            If there is not already, can there be a named law of discussion along the lines of

                            On any article about destructors (or rough equivalents going by other names), the majority of comments will be users fruitlessly arguing over exactly when and whether destructors run rather than engaging with the article.