The older I get, the more attracted I feel towards garbage-collected languages. 👴💻
I suspect the underlying issue is that once you start working on ~1M lines of C++ – I believe SerenityOS + LadyBird is somewhere in that ballpark – you no longer believe in or desire manual memory management! (Linux kernel is somewhere in that ballpark too, with C code)
Personally my threshold would be more like 20K or 100K lines of C/C++, but it’s the same phenomenon. When working on large C++ programs I do get the sense of “writing the program twice” – the application, and the memory management.
They have to be maintained in sync to an extent. Memory management can “warp” algorithms and data structures too. (I guess the degree to which people fight the Rust borrow checker with their idioms could be an indication of this)
Now that I’ve worked on a garbage collector, I see a sweet spot in languages like Go and C# – they have both value types deallocated on the stack and GC. Both Java and Python lack this semantic, so the GCs have to do more work, and the programmer has less control. OCaml lacked this until very recently, and I think Dart might lack it too.
I also have more of an intuition for how slow malloc() and free() are, though neither idiomatic C or C++ calls malloc() and free() a lot.
This was a great explanation of RCU too. Although I would quibble with a few statements:
At this point, some folks fire back with non-arguments about how this isn’t “real” garbage collection. Like, uh, because you manually mark the garbage! I’m not here to argue taxonomy—whatever you want to call it, RCU has the same shape as GC: memory is cleaned up eventually, based on whether it’s still in use
It’s still possible to argue that it’s not “real” or fully general GC based on the shape of the object graph.
Tracing / marking is often an expensive part of GC. And I don’t see that there is ANY tracing here. Example use case:
Say we have data that is read constantly but written rarely—something like the set of USB devices currently plugged in. In computer years this set changes once a millennium, but it can change.
There many considerations in the shape of the object graph
Do the objects contain ANY pointers at all? (Are they leaf objects?)
Are there pointers from the old generation to the new generation?
Are there pointers from the new generation to the old generation?
Are there pointers to manually managed objects?
In a recent thread it was pointed out that Erlang has no pointers from old -> new because objects are immutable, so generational GC in Erlang is significantly easier/faster.
Modern garbage collection offers optimizations that alternatives can not. A moving, generational GC periodically recompacts the heap.
But there is a tradeoff here – moving GC requires precise rooting, and I found that rooting can be as expensive as marking AND sweeping (depending on GC and workload obviously)! It’s generally hidden because it doesn’t contribute to pauses, but it slows your mutator throughput.
free() is not free. A general-purpose memory allocator has to maintain lots of internal, global state. What pages have we gotten from the kernel?
I definitely saw this on the Oils garbage collector.
On workloads that allocate lots of tiny objects, glibc free() was the slowest thing. It was slower than marking, or sweeping. It does a lot of non-trivial and non-deterministic work. It feels like its own little language runtime.
The fix was to use a pool allocator with faster deallocation, which could speed up certain workloads by 40+% overall!
So actually I think one of the main things I learned was it’s not just marking and sweeping that are slow. It’s also rooting and the underlying allocator. (Many/most GCs have their own allocators, but we started out with malloc() – which means ASAN works without any change, which turned out to be essential for correctness)
But yes I definitely find it fascinating to compare GC and manual memory management. To a certain extent, large programs with manual memory management are going to “invent” all sorts of strategies for GC-like things.
However I would not go as far as to say they’re equivalent to GC. They resemble GC, for sure.
I think there is probably some room for a little more taxonomy, to clear things up, similar to the well known equivalence between GC and ref counting.
As henry baker famously put it: stack allocation is an implementation issue, not a language issue. I would refine to: stack allocation is a scheduling problem, not a specification problem.
e.g. one issue is the Java language actually contrains value types / stack allocation, because object IDs are exposed. It has to be changed
If what you said were true, then neither OCaml nor Java would have to change the language in order to be more efficient. Certainly they have both done decades of big optimizations without changing the source language.
But they are changing the language.
I have been having this argument with @tekknolagi as well. In Oils we will probably add a StackArray[T] to our metalanguage. It will make some things much faster by getting rid of an utter boatload of garbage.
Besides the empirical evidence, philosophically I’d say that performance is feature, and it should be present in the source language where it matters. So that programmers and maintainers can reason about it.
On the other hand, the language shouldn’t burden you with irrelevant performance details. So there’s a tricky balance that’s application dependent – you can’t make any absolute statements
I’m not talking about how people are designing things; I’m talking about how they should design things, so those examples are neither here nor there.
Take halide and sql as examples of environments that effectively distinguish specification from schedule to at least some degree. Now imagine taking that several steps further, and solving the various problems that those environments have (e.g. query plans in sql can be brittle and opaque).
Again, it’s application dependent. I also wouldn’t blithely dismiss what OCaml, Java, and C# and Go do (because the latter 2 also expose the difference in the language) without evidence, or even an argument
Probably you haven’t really hit those use cases, which is fine. But “how they should design things” is honestly silly and naive
This sounds like one of Perlis’ “attention to the irrelevant” kinds of situations. Fundamentally, the stack and the heap are merely two different ways of interacting with a memory controller using relatively short opcodes; the spilled values are stored in roughly the same place with roughly the same memory model. There’s nothing wrong with giving that sort of “low level” control to the programmer, but it shouldn’t be confused for the inherent complexity of programming a Von Neumann machine.
@moonchild is probably referring to Henry Baker’s Cheney on the MTA paper where he describes a garbage collected runtime where there is no difference between allocating on the stack and allocating on the heap. Bumping the stack pointer is how heap allocation is performed.
Jane Street can’t create a radically different runtime for Ocaml because there’s a lot of code that depends on the existing runtime structures. There is C code that uses the <caml/mlvalues.h> API for implementing Ocaml primitives using C code. There are interfaces like Obj.magic that let you directly introspect the internal representation of heap memory. Since they can’t break that existing code, the only way forward is to layer more complexity on top of the existing language design in a way that preserves backward compatibility. So by default code uses the old runtime representation, but there is a new keyword local that explicitly opts in to the new more efficient runtime representation. Yuck.
Hm that seems like just the opposite – it’s turning the stack into garbage that needs to be tracked, which is bad:
In our Scheme, the entire C “stack” is effectively the youngest generation in a generational garbage collector!
The OCaml work is very clear about its motivation, e.g. avoiding the minor heap, among other things:
Every minor heap allocation brings us closer to the next minor collection cycle. A minor collection incurs some fixed overhead, but more importantly, frequent collection causes more values to be moved to the major heap. Promoted values become much costlier to collect later on.
The idea with OCaml, Go, C#, Java, etc. is to put more objects on the stack so they don’t touch the collector at all. They’re pushing to get into the C++/Rust performance realm.
I agree that such value types aren’t necessary for 99% of the code in a program, but applying it to the 1% of the code that’s a bottleneck can give you 50%-100% speed increase you can’t get any other way. You do kinda need it for many workloads
I also don’t think Lisp is very relevant here, because it’s “the wrong order of magnitude”. The OP is about writing kernels, and I mentioned writing interpreters.
I put Lisp/JS/Python together in performance (“tiny object soup” if you want to be pejorative). And then C++/Rust together (the “LLVM semantics” – not object soup).
There used to be a 10x performance gap separating them – I believe on modern hardware it’s more like 20x-50x, due to cache / memory layout / etc. (and no I don’t believe anyone’s fibonacci microbenchmarks – not at all realistic for kernels or interpreters)
e.g. kernels or interpreters in Lisp/JS/Python are “not really a thing”, because those languages can’t express fast enough code.
Our mycpp translator gets 50x+ speedups by using more C++ semantics than Python semantics, which is why it required significant code rewrites. Oils started out as a dynamic program, but it’s now a static program with compile-time metaprogramming.
Related: Lisp runtimes aren’t a good target of metaprogramming. I think something closer to Clasp (Common Lisp metaprogramming LLVM) or Carp is what you would need to get in the “right order of magnitude” for writing kernels or interpreters.
Oils is more similar to those Lispy projects – it’s Python metaprogramming a C++ IR/runtime. And it’s AOT and predictable, not JIT.
You have to switch languages to get in the right order of magnitude. Tiny object soup and JIT isn’t good enough. It’s slow AND it uses a lot of memory.
We did it a very unusual way, but it’s essentially the same thing – we switched to C++ semantics.
Clasp is a bit hellish to use, so I’m told by one of its developers, and I believe it’s not particularly fast, compared with sbcl (which is not particularly fast either, but it does all right)—the point was interoperability. And llvm is a crap base to build on anyway.
I was gonna do a demo of gemm in sbcl matching what you can do with c+intrinsics (still losing to assembly, because no duh we are still working with crap compilers on both sides), just to dispel this silly myth, but got bored halfway through writing it; too bad. Still, in the domain of silly benchmarks, we have a web server, a concurrent hash table, a regex engine, and this slightly questionable paper.
(Ed: I should say that my claims about common lisp’s performance here are not connected to my comments elsewhere about separation of concern. Existing common lisp environments do not solve these problems satisfactorily, but that does not inhibit their ability to play in the same performance class as c++, given the same sort of tuning as is required by high-performance c++ code.)
JIT
‘Jit’ is not ‘the right thing’—hence the problems everybody talks about w.r.t. it—but it is much closer to the right thing than ‘aot’ (I don’t think these are very useful words, and they’re certainly not descriptive; I use ‘jit’ to mean a particular lineage of designs descended from self implementations and including v8 and hotspot, and ‘aot’ to mean a particular lineage of designs descended from old fortran implementations and including llvm and gcc, which I think matches what people actually mean when they use the terms, even if they don’t realise it). And c++ performance can be brittle too. This problem can be solved systematically. But what you need for that is proper support for refinement and scheduling, not metaprogramming.
50x … Python
I don’t think the performance characteristics of python implementations demonstrate anything in particular. I mean, it’s taken them how many decades to remove the gil??
My perspective is that I would rather write simple high level code, where the code only reflects my high level semantic concerns. I don’t want to spend most of my mental energy on performance related type annotations and the like. I don’t want to mentally track, for every single expression in the program, is this value allocated on the heap, or is this value stored on the stack. I would prefer to let an optimizing compiler make these decisions automatically. This “optimizing compiler” could be a tracing JIT compiler. Javascript implementations are able to generate optimized code that uses unboxed representations, at runtime. FWIW even my simple hobby language is able to do this, although I don’t want to make grand general claims about it.
For this reason, I’m not a fan of languages like Rust. The Ocaml article title was a bit triggering because it says they are “oxidizing” Ocaml, which from my perspective means burning it to the ground and rendering it unusable for the style of programming I prefer.
You said:
I agree that such value types aren’t necessary for 99% of the code in a program, but applying it to the 1% of the code that’s a bottleneck can give you 50%-100% speed increase you can’t get any other way.
If the Ocaml feature was something that you could just apply only to hot code, that would be one thing. But the scheme they are proposing requires changing the type signature of the transitive closure of every function called by the hot code, in order to get the performance benefits. So every Ocaml library will eventually need to be “oxidized”. It’s a global transformation, turning simple code into complex code.
The alternative is for an optimizing compiler to apply unboxing transformations automatically. This typically means function specialization, where you generate multiple versions of the code for a function, specialized and optimized to different use cases. This can be done using inlining (what my hobby language does when it generates GPU code). There’s another front page Lobste.rs article where someone has cited Spiral, which uses partial evaluation to eliminate boxing. Zig uses compile time function specialization to eliminate the need for async “function colouring”. Julia does run time specialization, but they don’t use a tracing JIT compiler, they use a simpler scheme. When you call a function that is generic across argument types, they generate machine code that is specialized to the particular argument types that you are passing. Javascript and RPython have a more complex tracing JIT compiler.
kernels or interpreters in Lisp/JS/Python are “not really a thing”, because those languages can’t express fast enough code.
My perspective is that I would rather write simple high level code, where the code only reflects my high level semantic concerns
I totally agree, which is why I’ve spent a number of years writing a shell in Python :)
Turns out that performance is deeper than I thought. I learned lots of things the hard way :)
Actually as recently as 2 years ago I thought Oils would be fast enough without value types. And I even had an e-mail conversation with Graydon Hoare to that effect.
I knew that the “Graydon Rust” wasn’t playing in the C++ ballpark of performance – it was more like Go probably.
I thought it didn’t have value types. But Graydon was basically like “no it had value types; you need value types for performance”.
Obviously this depend on the application again – I’m talking about writing interpreters and kernels.
If the Ocaml feature was something that you could just apply only to hot code, that would be one thing. But the scheme they are proposing requires changing the type signature of the transitive closure of every function called by the hot code, in order to get the performance benefits.
Hm I don’t really see the problem here
The hot functions tend to be the low level ones with few things in the transitive closure. This is more like a feature than a bug.
Again the feature is to push OCaml/Java/C#/Go into the C++/Rust level of performance, without actually writing C++/Rust.
The alternative is for an optimizing compiler to apply unboxing transformations automatically
It’s definitely something we considered. But compiler transformations can fall off cliffs. They also can fail to compose when you re-order them.
It is an option, but if you want the whole system to be simple, I think value types are actually simpler.
The problem is that you want predictable performance.
Irrelevant things should not be present in the source code – but sometimes performance is relevant. Not just sometimes, but often!
Also, we are trying not to compromise on memory safety – so we have GC. GC is expensive, and we want to recover that performance.
If someone wants to implement the AST-JIT optimization for us, I’d be all for it. I’d like to see it in a production system.
I think it’s just way more complex and fragile than some simple value types. (though I haven’t actually implemented them yet – I could be wrong)
In Java or C#, all objects are passed by reference, not by value. That also means that two objects have a different identity, even if they have the same value. That’s why these languages need the concept of value types. In Go and Rust, it works the other way around. Everything is a value. And if you need a reference, then you use a reference type that references a value type. Interestingly, this is orthogonal to the question of allocating to the stack or the heap (using escape analysis etc.).
I’m sharing that because I think languages like Java, C#, Python and many others have tried to avoid the complexity of references/pointers by using only objects passed by reference, “hiding” the reference concept, but they have to reintroduce the value types anyway because they are more fundamental and, as you wrote, also necessary for performance (like allocating data structures in contiguous memory).
Yup that’s a big language design distinction – what’s the default? Values or pointers?
C++/Rust/Go/Swift favor value semantics, but they also have pointers (with C++ being unsafe because of that)
While Java/C# have the other default, but right now only C# has value types.
I would say the value types are more “fundamental” for performance, and using the machine
But reference types are more “fundamental” because they match computer science / recursion a bit better. i.e. references and GC came from Lisp, which influenced OCaml, Java, etc.
I think a much more generic/useful way of thinking of value types (in line with java’s valhalla design) is in a semantic way — value classes are objects that lack identity. Leaving out this property allows for more optimizations, e.g. freedom to allocate on the stack, copy, etc.
One more possible thing we might want to give up is “synchronization” over the encapsulated state. E.g. we might not care that some thread might see our 2DPoint in a state where the x value has been updated, but the y not yet. This is called ‘tearing’, and according to the current plans primitive classes will look like this.
Sometimes letting the developer over-specify code behavior is actually worse from a performance perspective. I think Java’s approach is pretty cool here, letting 20 years old code bases also benefit from updates year-after-year, since the complexity lives mostly in the runtime. Also, SQL can also be in general, faster than a naive native implementation, due to only specifying the what, and not the how. Of course, depending on what niche we target it may not be the correct choice.
I agree, the author also cherry picks extreme cases and argues in favor of GC with no benchmarks whatsoever, or any other data point. He is in some sort of crusade against non-gc language, which is an strange rock to die on. It feels like, “Rust is not so much better than Go if you used on this specific way”. For most real world scenarios, gc languages can’t be used, in constrained environments, or where performance is critical, and there is not a lot of allocations to be made.
In case this seems one-sided, I said GC “can be” good for performance.
But it can also be atrociously bad for performance! It’s extremely multi-dimensional and workload dependent.
e.g. idiomatic Python is like idiomatic Lisp/OCaml – it allocates a ton of tiny objects, which are expensive to collect. It’s no surprise that functional languages had so much GC research associated with them!
GC is also problematic with large heaps, and with threads
But the larger point was that you’re also kinda “damned” if you have a million lines of C++ with manual memory management. It simply doesn’t compose, and you will have bugs, and it warps the application architecture IMO. It’s intertwined with everything.
I have a lot of newfound respect for the “memory management problem”. I think there is basically a “trilemma” of C++ (manual), Rust (static ), and GC languages. The holy grail would be to combine and interface them somehow, but all the problems occur at the boundaries (similar to gradual typing).
Python’s GC does ref counting with an occasional tracing round for finding circular references. RC is significantly slower than a tracing GC, it’s no accident that almost every managed language uses tracing GCs. So it may not be the best example.
Java’s model is extremely good and competitive (one good benchmark might be the binary tree of benchmark games, where it is the first managed language (or it might be Haskell currently, though its model is different) by far (and it makes zero sense with arena allocators, so non-managed languages can’t be reasonably compared)) — it is a thread-local bump allocator, so it’s no more expensive than stack allocation, besides checking if there is enough space for the allocation. When it gets full, the surviving objects are evacuated/moved to the youngest generation, and the whole “arena” is cleared.
it is a thread-local bump allocator, so it’s no more expensive than stack allocation
That isn’t entirely true. This kind of generational GC typically chews through a lot of memory in its allocation nursery, which causes a lot of write traffic through the caches due to object initialization – evicting cache lines for older objects and allocating fresh lines for newer ones. Whereas stack allocation walks back and forth over a relatively small area of memory, and object initialization is more often overwriting memory in cache without causing evictions.
I should have wrote “not significantly more expensive”, true, though if we compare it to the stack and allocate objects of similar lifetimes I don’t think there is a significant difference [1]. Sure, the actually hot cache line moves slowly forward as we allocate new stuff, jumping back for older objects - but that jump back also happens with stack allocation (say, we pass a pointer to a local variable), even if it might be a shorter jump due to the stack’s tendency to stay roughly at the same depth. Predictable forward movement in memory is very well-optimized and since only a single thread uses this memory, writing to it is no big deal from a cache perspective.
Of course I’m no expert at this low level, so please correct me if you know better :) I’m all for learning new things.
[1] larger object size can actually be a problem, though. Not sure if Java could/do use a different object header here, though? It might not require everything in this special area.
RC is significantly slower than a tracing GC, it’s no accident that almost every managed language uses tracing GCs.
It depends. A case in point is Objective-C, which has used ref-counting since the days of OpenStep. Apple implemented an optional tracing GC for it ~2005, but it never took off; it was nice not having to write retain/release all the time, but it made apps slower on average, with especially annoying long pauses. Then automatic ref-counting came out ~2012 and took over.
I think in languages that allocate zillions of tiny objects, like Lisp or Python, RC might be slower. But RC’s advantages are that it can be combined with other approaches (viz. C++’s shared_ptr / unique_ptr / stack allocation), and that it tends to be highly incremental, avoiding noticeable pauses.
Similar idea as moving Rust values into a background thread to drop them… Hmmm. Sounds like it’s probably worth generally considering another axis of memory management beyond just “manual/automatic”: the axis of immediate/deferred. Deferred memory management seems very useful for threaded programs because it can figure out when to free things at runtime, which threaded programs generally need to do anyway. The cost of that is then potentially higher (theoretically even unbounded) peak memory usage.
It seems it’s generally limited to deferring operations that are also satisfied at program exit, though; e.g. deferring a file flush would lose data if a segfault came along.
Objective-C’s autorelease is/was a form of this. In a ref-counted runtime without magic smart pointers you have the problem of how a function should return a reference that it’s the caller’s responsibility to release, e.g. an object it just allocated. Sometimes it’s just a naming convention and the caller has to get it right or else it leaks or causes eventual use-after-free bugs.
autorelease is a deferred version of release. It does nothing except promise that at some future time release will be called. So if you create an object, autorelease it and return it, everything’s the same as if you just returned a ref to an object you own.
(The cleanup is usually done in code at the bottom of the thread’s stack, at the end of every event-loop iteration where it can guarantee that no stack frame is pointing to the autoreleased object.)
A moving, generational GC periodically recompacts the heap. This provides insane throughput, since allocation is little more than a pointer bump! It also gives sequential allocations great locality, helping cache performance.
In my old age I’ve become skeptical about “no really, GC is just as fast as manual MM or ref-counting” claims. The above argument in particular ignores the fact that compacting or copying the heap is the worst thing you can possibly do for cache performance — you’re blowing away all your caches and are at the mercy of the bandwidth to/from RAM. Even a non-world-stopping GC will kneecap other threads’ performance this way.
Just wanted to refine this point a bit: using nontemporal copies you’ll impact the cache far less. x64 has this (in fact it’s usually faster than regular copies), and I think A64 does too.
Yes, yes. For all the stories over all these years about GC’s theoretical performance benefits, I would expect that all the system software and performance critical software will be GCed by now, even despite all the system software engineers vehemently opposed, because they really love tracking their allocations by hand.
First, most if not all of these benefits come with big asterisks, like “yeah, we’re going to need to preallocate all your VM instance memory for the Java GC heap for it to work well”. Which is hefty price to pay in system software (or any software really). Or “we’re going to need to do write barriers on pointer updates”. Or “in certain scenarios performance is going to grind to a halt, because we have expectations about your usage patterns”.
I don’t think System Programmers are against GC’s per se. I would be very happy with some magical GC<T> in Rust, that I could opt-into when it suits me. But that never how it works, is it? The single biggest problem with GCs is that they always bring in some crappy runtime with them that I don’t want, as it will not compose with any other crappy GC runtime.
RCU and ARC are kind of neat because even thought they can be considered a form of a GC, they actually can be used locally for a certain purpose without affecting other parts of the system. Same with areans and stuff like that. That’s why they get used in system software, and general “GC” does not. The performance is a secondary consideration, thought it is kind of nice to get other camps annoyed that their GCed software is also slower, which is not even all that much caused by the GC fundamentals, but just encouraged by carelessness about memory management and similar considerations like it (“let’s use dynamic dispatch everywhere, scatter our data all over the heap pointing at each other, and sell it as a best things since sliced bread”).
Isn’t memory footprint a big issue with gc vs malloc? To get throughput you want to run gc rarely, so you need more memory for the accumulating garbage. It doesn’t slow down the gc because most objects will be unreachable from the root objects. But it does require more memory for the same functionality, or you run the gc more often and get more slowdown.
Is this story substantially wrong? Honest q, I know little about gc
Yes definitely, Chris Lattner has said that Swift (and Objective C) use ref counting because GC would mean that every iPhone needs 20 GiB of RAM instead of 16 GiB, and that costs real money. Or something to that effect.
The overheads seem to be:
Per object GC headers, to support tracing, etc. HotSpot JVM seems to use 16 bytes per object, which is a lot. My project uses 8 bytes per object, although we have many more limitations than JVM too (not generational, etc.).
C++ programmers can cram a ridiculous amount of information in 8 or 16 bytes. Our GC uses 24 and 48 byte pool allocators, which gives a sense of how tiny most objects are, and thus how much the relative overhead is.
Collecting later than necessary, as you mention (although this is complex and depends on the OS – does sweeping in your process actually return memory to other applications?)
Most fast garbage collectors use a Cheney/copying generation for the small/young generation. It’s small, but there’s still 2x overhead there.
GC tends to be associated with “object soup” like heaps. Instead of tight packing like C/C++, you have lots of pointers. All those pointers take a lot of space, especially in the 64-bit world. Again, you could pack a lot of other information in that 8-byte pointer.
Once you have GC, libraries will tend to allocate lots of objects. Even if application programmers are careful, they will likely be building on big libraries that they don’t control.
I suspect the latter issue is why the D language had a split in its stdlib – GC vs. no GC.
Despite all that, I still like GC, for memory safety, and for composable application architecture. I think it should be paired with multi-process programs (Unix style), so you can actually use an entire machine. It’s almost impossible to use a full modern machine with a single GC heap. You need a bunch of shared-nothing programs for full utilization
I’m not an expert in this area, but I think if you want to use say a 32 GiB heap with a thread per core, say 16 or 64 cores, you might be limited to some special and very application-tuned JVMs
And probably .NET. There are not that many options for those kinds of workloads AFAIK
Not to say nobody does it, but I think shared nothing is a lot easier
Inherently, GC has a lot of contention. There are obviously ways around that, but they are complex
Erlang kind shows the point - the “threads” don’t share memory, etc. They all have separate GCs by design, and you’re copying between “threads” / Erlang processes, not sharing
What about Go, which promises millions of goroutines multiplexed on a thread pool utilizing all of your CPUs? I assume the Go runtime handles this well since it’s “the” use case for Go
I don’t disagree with the general point that GC can be good for performance. Interesting tweet I saw from Andreas Kling:
https://twitter.com/awesomekling/status/1767236181010350388
I suspect the underlying issue is that once you start working on ~1M lines of C++ – I believe SerenityOS + LadyBird is somewhere in that ballpark – you no longer believe in or desire manual memory management! (Linux kernel is somewhere in that ballpark too, with C code)
Personally my threshold would be more like 20K or 100K lines of C/C++, but it’s the same phenomenon. When working on large C++ programs I do get the sense of “writing the program twice” – the application, and the memory management.
They have to be maintained in sync to an extent. Memory management can “warp” algorithms and data structures too. (I guess the degree to which people fight the Rust borrow checker with their idioms could be an indication of this)
Now that I’ve worked on a garbage collector, I see a sweet spot in languages like Go and C# – they have both value types deallocated on the stack and GC. Both Java and Python lack this semantic, so the GCs have to do more work, and the programmer has less control. OCaml lacked this until very recently, and I think Dart might lack it too.
I also have more of an intuition for how slow malloc() and free() are, though neither idiomatic C or C++ calls malloc() and free() a lot.
This was a great explanation of RCU too. Although I would quibble with a few statements:
It’s still possible to argue that it’s not “real” or fully general GC based on the shape of the object graph.
Tracing / marking is often an expensive part of GC. And I don’t see that there is ANY tracing here. Example use case:
There many considerations in the shape of the object graph
In a recent thread it was pointed out that Erlang has no pointers from old -> new because objects are immutable, so generational GC in Erlang is significantly easier/faster.
But there is a tradeoff here – moving GC requires precise rooting, and I found that rooting can be as expensive as marking AND sweeping (depending on GC and workload obviously)! It’s generally hidden because it doesn’t contribute to pauses, but it slows your mutator throughput.
I definitely saw this on the Oils garbage collector.
On workloads that allocate lots of tiny objects, glibc free() was the slowest thing. It was slower than marking, or sweeping. It does a lot of non-trivial and non-deterministic work. It feels like its own little language runtime.
The fix was to use a pool allocator with faster deallocation, which could speed up certain workloads by 40+% overall!
So actually I think one of the main things I learned was it’s not just marking and sweeping that are slow. It’s also rooting and the underlying allocator. (Many/most GCs have their own allocators, but we started out with malloc() – which means ASAN works without any change, which turned out to be essential for correctness)
But yes I definitely find it fascinating to compare GC and manual memory management. To a certain extent, large programs with manual memory management are going to “invent” all sorts of strategies for GC-like things.
However I would not go as far as to say they’re equivalent to GC. They resemble GC, for sure.
I think there is probably some room for a little more taxonomy, to clear things up, similar to the well known equivalence between GC and ref counting.
As henry baker famously put it: stack allocation is an implementation issue, not a language issue. I would refine to: stack allocation is a scheduling problem, not a specification problem.
I don’t agree with that … Consider recent changes to OCaml – there are new allocation “modes”, which are present in the source language:
https://blog.janestreet.com/oxidizing-ocaml-locality/
Similarly with Java - https://openjdk.org/projects/valhalla/
e.g. one issue is the Java language actually contrains value types / stack allocation, because object IDs are exposed. It has to be changed
If what you said were true, then neither OCaml nor Java would have to change the language in order to be more efficient. Certainly they have both done decades of big optimizations without changing the source language.
But they are changing the language.
I have been having this argument with @tekknolagi as well. In Oils we will probably add a
StackArray[T]to our metalanguage. It will make some things much faster by getting rid of an utter boatload of garbage.Besides the empirical evidence, philosophically I’d say that performance is feature, and it should be present in the source language where it matters. So that programmers and maintainers can reason about it.
On the other hand, the language shouldn’t burden you with irrelevant performance details. So there’s a tricky balance that’s application dependent – you can’t make any absolute statements
I’m not talking about how people are designing things; I’m talking about how they should design things, so those examples are neither here nor there.
Take halide and sql as examples of environments that effectively distinguish specification from schedule to at least some degree. Now imagine taking that several steps further, and solving the various problems that those environments have (e.g. query plans in sql can be brittle and opaque).
Again, it’s application dependent. I also wouldn’t blithely dismiss what OCaml, Java, and C# and Go do (because the latter 2 also expose the difference in the language) without evidence, or even an argument
Probably you haven’t really hit those use cases, which is fine. But “how they should design things” is honestly silly and naive
If you think so, then you have misunderstood the nature of my complaint. It is not about workloads; it is entirely about separation of concern.
This sounds like one of Perlis’ “attention to the irrelevant” kinds of situations. Fundamentally, the stack and the heap are merely two different ways of interacting with a memory controller using relatively short opcodes; the spilled values are stored in roughly the same place with roughly the same memory model. There’s nothing wrong with giving that sort of “low level” control to the programmer, but it shouldn’t be confused for the inherent complexity of programming a Von Neumann machine.
@moonchild is probably referring to Henry Baker’s Cheney on the MTA paper where he describes a garbage collected runtime where there is no difference between allocating on the stack and allocating on the heap. Bumping the stack pointer is how heap allocation is performed.
Jane Street can’t create a radically different runtime for Ocaml because there’s a lot of code that depends on the existing runtime structures. There is C code that uses the
<caml/mlvalues.h>API for implementing Ocaml primitives using C code. There are interfaces likeObj.magicthat let you directly introspect the internal representation of heap memory. Since they can’t break that existing code, the only way forward is to layer more complexity on top of the existing language design in a way that preserves backward compatibility. So by default code uses the old runtime representation, but there is a new keywordlocalthat explicitly opts in to the new more efficient runtime representation. Yuck.Hm that seems like just the opposite – it’s turning the stack into garbage that needs to be tracked, which is bad:
The OCaml work is very clear about its motivation, e.g. avoiding the minor heap, among other things:
The idea with OCaml, Go, C#, Java, etc. is to put more objects on the stack so they don’t touch the collector at all. They’re pushing to get into the C++/Rust performance realm.
I agree that such value types aren’t necessary for 99% of the code in a program, but applying it to the 1% of the code that’s a bottleneck can give you 50%-100% speed increase you can’t get any other way. You do kinda need it for many workloads
I also don’t think Lisp is very relevant here, because it’s “the wrong order of magnitude”. The OP is about writing kernels, and I mentioned writing interpreters.
I put Lisp/JS/Python together in performance (“tiny object soup” if you want to be pejorative). And then C++/Rust together (the “LLVM semantics” – not object soup).
There used to be a 10x performance gap separating them – I believe on modern hardware it’s more like 20x-50x, due to cache / memory layout / etc. (and no I don’t believe anyone’s fibonacci microbenchmarks – not at all realistic for kernels or interpreters)
e.g. kernels or interpreters in Lisp/JS/Python are “not really a thing”, because those languages can’t express fast enough code.
Our mycpp translator gets 50x+ speedups by using more C++ semantics than Python semantics, which is why it required significant code rewrites. Oils started out as a dynamic program, but it’s now a static program with compile-time metaprogramming.
Related: Lisp runtimes aren’t a good target of metaprogramming. I think something closer to Clasp (Common Lisp metaprogramming LLVM) or Carp is what you would need to get in the “right order of magnitude” for writing kernels or interpreters.
Oils is more similar to those Lispy projects – it’s Python metaprogramming a C++ IR/runtime. And it’s AOT and predictable, not JIT.
Also Related – if Python/JS/Lisp were fast enough, people wouldn’t be rewriting lots of tools by hand in Go and Rust - https://news.ycombinator.com/item?id=35045520 .
You have to switch languages to get in the right order of magnitude. Tiny object soup and JIT isn’t good enough. It’s slow AND it uses a lot of memory.
We did it a very unusual way, but it’s essentially the same thing – we switched to C++ semantics.
Clasp is a bit hellish to use, so I’m told by one of its developers, and I believe it’s not particularly fast, compared with sbcl (which is not particularly fast either, but it does all right)—the point was interoperability. And llvm is a crap base to build on anyway.
I was gonna do a demo of gemm in sbcl matching what you can do with c+intrinsics (still losing to assembly, because no duh we are still working with crap compilers on both sides), just to dispel this silly myth, but got bored halfway through writing it; too bad. Still, in the domain of silly benchmarks, we have a web server, a concurrent hash table, a regex engine, and this slightly questionable paper.
(Ed: I should say that my claims about common lisp’s performance here are not connected to my comments elsewhere about separation of concern. Existing common lisp environments do not solve these problems satisfactorily, but that does not inhibit their ability to play in the same performance class as c++, given the same sort of tuning as is required by high-performance c++ code.)
‘Jit’ is not ‘the right thing’—hence the problems everybody talks about w.r.t. it—but it is much closer to the right thing than ‘aot’ (I don’t think these are very useful words, and they’re certainly not descriptive; I use ‘jit’ to mean a particular lineage of designs descended from self implementations and including v8 and hotspot, and ‘aot’ to mean a particular lineage of designs descended from old fortran implementations and including llvm and gcc, which I think matches what people actually mean when they use the terms, even if they don’t realise it). And c++ performance can be brittle too. This problem can be solved systematically. But what you need for that is proper support for refinement and scheduling, not metaprogramming.
I don’t think the performance characteristics of python implementations demonstrate anything in particular. I mean, it’s taken them how many decades to remove the gil??
My perspective is that I would rather write simple high level code, where the code only reflects my high level semantic concerns. I don’t want to spend most of my mental energy on performance related type annotations and the like. I don’t want to mentally track, for every single expression in the program, is this value allocated on the heap, or is this value stored on the stack. I would prefer to let an optimizing compiler make these decisions automatically. This “optimizing compiler” could be a tracing JIT compiler. Javascript implementations are able to generate optimized code that uses unboxed representations, at runtime. FWIW even my simple hobby language is able to do this, although I don’t want to make grand general claims about it.
For this reason, I’m not a fan of languages like Rust. The Ocaml article title was a bit triggering because it says they are “oxidizing” Ocaml, which from my perspective means burning it to the ground and rendering it unusable for the style of programming I prefer.
You said:
If the Ocaml feature was something that you could just apply only to hot code, that would be one thing. But the scheme they are proposing requires changing the type signature of the transitive closure of every function called by the hot code, in order to get the performance benefits. So every Ocaml library will eventually need to be “oxidized”. It’s a global transformation, turning simple code into complex code.
The alternative is for an optimizing compiler to apply unboxing transformations automatically. This typically means function specialization, where you generate multiple versions of the code for a function, specialized and optimized to different use cases. This can be done using inlining (what my hobby language does when it generates GPU code). There’s another front page Lobste.rs article where someone has cited Spiral, which uses partial evaluation to eliminate boxing. Zig uses compile time function specialization to eliminate the need for async “function colouring”. Julia does run time specialization, but they don’t use a tracing JIT compiler, they use a simpler scheme. When you call a function that is generic across argument types, they generate machine code that is specialized to the particular argument types that you are passing. Javascript and RPython have a more complex tracing JIT compiler.
The Lisp Machine kernel was written in Lisp. I can write GPU kernels in my simple, dynamically typed hobby language. The “Two little interpreters” post on Lobste.rs says that yes, you can write interpreters in dynamic languages and get decent performance, if they are implemented using a tracing JIT. https://stefan-marr.de/downloads/oopsla23-larose-et-al-ast-vs-bytecode-interpreters-in-the-age-of-meta-compilation.pdf
I just cited a whole range of implementation techniques for optimization without oxidization, and I think there is still lots more that can be done.
I totally agree, which is why I’ve spent a number of years writing a shell in Python :)
Turns out that performance is deeper than I thought. I learned lots of things the hard way :)
Actually as recently as 2 years ago I thought Oils would be fast enough without value types. And I even had an e-mail conversation with Graydon Hoare to that effect.
I knew that the “Graydon Rust” wasn’t playing in the C++ ballpark of performance – it was more like Go probably.
I thought it didn’t have value types. But Graydon was basically like “no it had value types; you need value types for performance”.
Obviously this depend on the application again – I’m talking about writing interpreters and kernels.
Hm I don’t really see the problem here
The hot functions tend to be the low level ones with few things in the transitive closure. This is more like a feature than a bug.
Again the feature is to push OCaml/Java/C#/Go into the C++/Rust level of performance, without actually writing C++/Rust.
It’s definitely something we considered. But compiler transformations can fall off cliffs. They also can fail to compose when you re-order them.
It is an option, but if you want the whole system to be simple, I think value types are actually simpler.
The problem is that you want predictable performance.
Irrelevant things should not be present in the source code – but sometimes performance is relevant. Not just sometimes, but often!
Also, we are trying not to compromise on memory safety – so we have GC. GC is expensive, and we want to recover that performance.
If someone wants to implement the AST-JIT optimization for us, I’d be all for it. I’d like to see it in a production system.
I think it’s just way more complex and fragile than some simple value types. (though I haven’t actually implemented them yet – I could be wrong)
In Java or C#, all objects are passed by reference, not by value. That also means that two objects have a different identity, even if they have the same value. That’s why these languages need the concept of value types. In Go and Rust, it works the other way around. Everything is a value. And if you need a reference, then you use a reference type that references a value type. Interestingly, this is orthogonal to the question of allocating to the stack or the heap (using escape analysis etc.).
I’m sharing that because I think languages like Java, C#, Python and many others have tried to avoid the complexity of references/pointers by using only objects passed by reference, “hiding” the reference concept, but they have to reintroduce the value types anyway because they are more fundamental and, as you wrote, also necessary for performance (like allocating data structures in contiguous memory).
Yup that’s a big language design distinction – what’s the default? Values or pointers?
C++/Rust/Go/Swift favor value semantics, but they also have pointers (with C++ being unsafe because of that)
While Java/C# have the other default, but right now only C# has value types.
I would say the value types are more “fundamental” for performance, and using the machine
But reference types are more “fundamental” because they match computer science / recursion a bit better. i.e. references and GC came from Lisp, which influenced OCaml, Java, etc.
[Comment removed by author]
I think a much more generic/useful way of thinking of value types (in line with java’s valhalla design) is in a semantic way — value classes are objects that lack identity. Leaving out this property allows for more optimizations, e.g. freedom to allocate on the stack, copy, etc.
One more possible thing we might want to give up is “synchronization” over the encapsulated state. E.g. we might not care that some thread might see our 2DPoint in a state where the x value has been updated, but the y not yet. This is called ‘tearing’, and according to the current plans primitive classes will look like this.
Sometimes letting the developer over-specify code behavior is actually worse from a performance perspective. I think Java’s approach is pretty cool here, letting 20 years old code bases also benefit from updates year-after-year, since the complexity lives mostly in the runtime. Also, SQL can also be in general, faster than a naive native implementation, due to only specifying the what, and not the how. Of course, depending on what niche we target it may not be the correct choice.
What garbage will you be getting rid of?
Mainly a bunch of local lists in the word evaluator, which is basically in the inner loop of the shell evaluator
I agree, the author also cherry picks extreme cases and argues in favor of GC with no benchmarks whatsoever, or any other data point. He is in some sort of crusade against non-gc language, which is an strange rock to die on. It feels like, “Rust is not so much better than Go if you used on this specific way”. For most real world scenarios, gc languages can’t be used, in constrained environments, or where performance is critical, and there is not a lot of allocations to be made.
In case this seems one-sided, I said GC “can be” good for performance.
But it can also be atrociously bad for performance! It’s extremely multi-dimensional and workload dependent.
e.g. idiomatic Python is like idiomatic Lisp/OCaml – it allocates a ton of tiny objects, which are expensive to collect. It’s no surprise that functional languages had so much GC research associated with them!
GC is also problematic with large heaps, and with threads
But the larger point was that you’re also kinda “damned” if you have a million lines of C++ with manual memory management. It simply doesn’t compose, and you will have bugs, and it warps the application architecture IMO. It’s intertwined with everything.
I have a lot of newfound respect for the “memory management problem”. I think there is basically a “trilemma” of C++ (manual), Rust (static ), and GC languages. The holy grail would be to combine and interface them somehow, but all the problems occur at the boundaries (similar to gradual typing).
Python’s GC does ref counting with an occasional tracing round for finding circular references. RC is significantly slower than a tracing GC, it’s no accident that almost every managed language uses tracing GCs. So it may not be the best example.
Java’s model is extremely good and competitive (one good benchmark might be the binary tree of benchmark games, where it is the first managed language (or it might be Haskell currently, though its model is different) by far (and it makes zero sense with arena allocators, so non-managed languages can’t be reasonably compared)) — it is a thread-local bump allocator, so it’s no more expensive than stack allocation, besides checking if there is enough space for the allocation. When it gets full, the surviving objects are evacuated/moved to the youngest generation, and the whole “arena” is cleared.
That isn’t entirely true. This kind of generational GC typically chews through a lot of memory in its allocation nursery, which causes a lot of write traffic through the caches due to object initialization – evicting cache lines for older objects and allocating fresh lines for newer ones. Whereas stack allocation walks back and forth over a relatively small area of memory, and object initialization is more often overwriting memory in cache without causing evictions.
I should have wrote “not significantly more expensive”, true, though if we compare it to the stack and allocate objects of similar lifetimes I don’t think there is a significant difference [1]. Sure, the actually hot cache line moves slowly forward as we allocate new stuff, jumping back for older objects - but that jump back also happens with stack allocation (say, we pass a pointer to a local variable), even if it might be a shorter jump due to the stack’s tendency to stay roughly at the same depth. Predictable forward movement in memory is very well-optimized and since only a single thread uses this memory, writing to it is no big deal from a cache perspective.
Of course I’m no expert at this low level, so please correct me if you know better :) I’m all for learning new things.
[1] larger object size can actually be a problem, though. Not sure if Java could/do use a different object header here, though? It might not require everything in this special area.
It depends. A case in point is Objective-C, which has used ref-counting since the days of OpenStep. Apple implemented an optional tracing GC for it ~2005, but it never took off; it was nice not having to write retain/release all the time, but it made apps slower on average, with especially annoying long pauses. Then automatic ref-counting came out ~2012 and took over.
I think in languages that allocate zillions of tiny objects, like Lisp or Python, RC might be slower. But RC’s advantages are that it can be combined with other approaches (viz. C++’s shared_ptr / unique_ptr / stack allocation), and that it tends to be highly incremental, avoiding noticeable pauses.
Similar idea as moving Rust values into a background thread to drop them… Hmmm. Sounds like it’s probably worth generally considering another axis of memory management beyond just “manual/automatic”: the axis of immediate/deferred. Deferred memory management seems very useful for threaded programs because it can figure out when to free things at runtime, which threaded programs generally need to do anyway. The cost of that is then potentially higher (theoretically even unbounded) peak memory usage.
This technique applies more generally beyond memory, e.g. in some situations a dedicated thread can make closing files much faster on Windows.
It seems it’s generally limited to deferring operations that are also satisfied at program exit, though; e.g. deferring a file flush would lose data if a segfault came along.
Objective-C’s
autoreleaseis/was a form of this. In a ref-counted runtime without magic smart pointers you have the problem of how a function should return a reference that it’s the caller’s responsibility to release, e.g. an object it just allocated. Sometimes it’s just a naming convention and the caller has to get it right or else it leaks or causes eventual use-after-free bugs.autoreleaseis a deferred version of release. It does nothing except promise that at some future timereleasewill be called. So if you create an object, autorelease it and return it, everything’s the same as if you just returned a ref to an object you own.(The cleanup is usually done in code at the bottom of the thread’s stack, at the end of every event-loop iteration where it can guarantee that no stack frame is pointing to the autoreleased object.)
In my old age I’ve become skeptical about “no really, GC is just as fast as manual MM or ref-counting” claims. The above argument in particular ignores the fact that compacting or copying the heap is the worst thing you can possibly do for cache performance — you’re blowing away all your caches and are at the mercy of the bandwidth to/from RAM. Even a non-world-stopping GC will kneecap other threads’ performance this way.
Just wanted to refine this point a bit: using nontemporal copies you’ll impact the cache far less. x64 has this (in fact it’s usually faster than regular copies), and I think A64 does too.
Yes, yes. For all the stories over all these years about GC’s theoretical performance benefits, I would expect that all the system software and performance critical software will be GCed by now, even despite all the system software engineers vehemently opposed, because they really love tracking their allocations by hand.
First, most if not all of these benefits come with big asterisks, like “yeah, we’re going to need to preallocate all your VM instance memory for the Java GC heap for it to work well”. Which is hefty price to pay in system software (or any software really). Or “we’re going to need to do write barriers on pointer updates”. Or “in certain scenarios performance is going to grind to a halt, because we have expectations about your usage patterns”.
I don’t think System Programmers are against GC’s per se. I would be very happy with some magical
GC<T>in Rust, that I could opt-into when it suits me. But that never how it works, is it? The single biggest problem with GCs is that they always bring in some crappy runtime with them that I don’t want, as it will not compose with any other crappy GC runtime.RCU and ARC are kind of neat because even thought they can be considered a form of a GC, they actually can be used locally for a certain purpose without affecting other parts of the system. Same with areans and stuff like that. That’s why they get used in system software, and general “GC” does not. The performance is a secondary consideration, thought it is kind of nice to get other camps annoyed that their GCed software is also slower, which is not even all that much caused by the GC fundamentals, but just encouraged by carelessness about memory management and similar considerations like it (“let’s use dynamic dispatch everywhere, scatter our data all over the heap pointing at each other, and sell it as a best things since sliced bread”).
how is
rcu_read_unlock()supposed to work, without passingFoo*as an argument?for rust, i’d guess you’d use something like
Cell<Arc<Foo,QueueAllocator>>for some value ofQueueAllocatorthe arugument for GC, after describing a custom allocation strategy, seems like a non-sequetur
Isn’t memory footprint a big issue with gc vs malloc? To get throughput you want to run gc rarely, so you need more memory for the accumulating garbage. It doesn’t slow down the gc because most objects will be unreachable from the root objects. But it does require more memory for the same functionality, or you run the gc more often and get more slowdown.
Is this story substantially wrong? Honest q, I know little about gc
Yes, it’s usually true. For example, Go has a GOGC tuning parameter that let’s you adjust the trade-off between CPU overhead and memory overhead.
Yes definitely, Chris Lattner has said that Swift (and Objective C) use ref counting because GC would mean that every iPhone needs 20 GiB of RAM instead of 16 GiB, and that costs real money. Or something to that effect.
The overheads seem to be:
I suspect the latter issue is why the D language had a split in its stdlib – GC vs. no GC.
Despite all that, I still like GC, for memory safety, and for composable application architecture. I think it should be paired with multi-process programs (Unix style), so you can actually use an entire machine. It’s almost impossible to use a full modern machine with a single GC heap. You need a bunch of shared-nothing programs for full utilization
Really? Why not multithreading? Interacts badly with gc?
I’m not an expert in this area, but I think if you want to use say a 32 GiB heap with a thread per core, say 16 or 64 cores, you might be limited to some special and very application-tuned JVMs
And probably .NET. There are not that many options for those kinds of workloads AFAIK
Not to say nobody does it, but I think shared nothing is a lot easier
Inherently, GC has a lot of contention. There are obviously ways around that, but they are complex
Erlang kind shows the point - the “threads” don’t share memory, etc. They all have separate GCs by design, and you’re copying between “threads” / Erlang processes, not sharing
What about Go, which promises millions of goroutines multiplexed on a thread pool utilizing all of your CPUs? I assume the Go runtime handles this well since it’s “the” use case for Go