I actually originally wrote esbuild in Rust and Go, and Go was the clear winner.
The parser written in Go was both faster to compile and faster to execute than the parser in Rust. The Go version compiled something like 100x faster than Rust and ran at something around 10% faster (I forget the exact numbers, sorry). Based on a profile, it looked like the Go version was faster because GC happened on another thread while Rust had to run destructors on the same thread.
ESBuild is a really impressive performance-oriented project:
The Rust version also had other problems. Many places in my code had switch statements that branched over all AST nodes and in Rust that compiles to code which uses stack space proportional to the total stack space used by all branches instead of just the maximum stack space used by any one branch: https://github.com/rust-lang/rust/issues/34283.
Objective-C is refcounted. One way you can decrement a refcount is [obj release] and that does what you expect, destroying the object when the refcount hits 0. But if you’re writing Objective-C in the first place you’re probably using it via Cocoa or UIKit which is probably running you in an event loop. In this case you can instead [obj autorelease] which defers the refcount until this run of the event loop is over. The nice thing about that is that you can return very quickly from the current (probably interactive) action with your real work done and wait to do the bookkeeping work until you’d probably be waiting on I/O anyway.
It would be nice if there were a way to structure rust arenas in such a way that the “current request”, whatever that means for your program, could gather up all of this pending work like this until you know you have some downtime. Or if you’re a short command line program, to never do it at all and wait for the process termination to clean everything up. This would really have to be at the language level since most of those drop insertions are going to be in third party library code.
(I don’t write much Objective C or Swift and welcome corrections on the details)
Yes but it isn’t a global “do refcounting everywhere”, it’s a utility function that lets you refcount a single object. You can’t force libraries to do it.
Uhm… I understand it’s a contrived example, but why on earth any such function that only needs to read something from HeavyThing’s state wouldn’t take it by reference? This would avoid the whole problem.
This is one of the reasons that chromium uses a GC for much of its c++. If you can schedule cleanup work you can significantly improve interactive performance.
It’s not really about Rust. What Rust does is insert the deallocation code automatically when the owner goes out of scope, but C or C++ code would need to deallocate an equivalent structure at some point (or they could decide to never deallocate and just let it leak, which is also an option in Rust), and the cost of deallocation isn’t based on the language. Chasing 1,000,000 pointers to free what they’re pointing to is going to be expensive in any language.
This still seems weird…like, why is the caller freeing what’s passed in? Is there some additional context (say, this function is the last one being invoked in the lexical scope of the lifetime of the value being passed in) here we’re not being given?
Or is Rust really using copy-on-pass semantics that require it to free the copies made during function invocation?
Rust uses move semantics for owned objects, which in this case effectively means “the compiler knows how to figure out where to insert free()/drop() calls for you”. The single-threaded example turns into:
The caller isn’t freeing it. The caller is moving the value into the callee and gives up ownership.
The callee is now the sole owner of the value and drops it implicitly (if it isn’t passed around further or passed back as a return value).
icefox and freddyb gave good answers and my Rust knowledge is not the deepest, but I wanted to address this part:
why is the caller freeing what’s passed in?
The main alternative to the “move” semantics when passing an argument to a function is to pass a reference instead, in which case the called function would be “borrowing” the value instead of owning it outright. In that case Rust would not insert the deallocation code at the end of the function because the compiler knows that the caller is not done with the value.
Depending on the purpose of the function and how it fits in with the rest of the code, another option for controlling when deallocation happens could be to pass in a reference and have a caller somewhere up the stack decide exactly when to drop it.
The other answers all cover aspects, but I feel they may not be clear enough.
When something is passed directly (not by reference) to a function, it will either be moved or copied (depending on whether the type implements the Copy trait, which is only possible for types which can be freed without chasing pointers). In the case of a move, the data now belongs to the callee function, and when the owner (the formal parameter name) goes out of scope (unless ownership is passed back to the caller in the function return, this would be at the end of the function), the Rust compiler inserts a drop call to free the owned data (calling the type’s Drop trait implementation). In the case of a copy, the same still happens, but the callee is working with a distinct copy of the data, so the caller’s copy of it is still valid and will be dropped when its own owner goes out of scope.
Doesn’t everything have this issue? It’s just the old memory management trade-off of “would you rather your performance be bounded by the size of the set of things you’re deallocating, or the size of the live set?”, isn’t it?
You can deterministically deallocate things as their owner goes out of scope, which is sometimes going to involve chasing pointers to a large number of interior allocations at an inconvenient moment (like the vector-of-vectors here), so you might want to move ownership onto a less critical thread and deallocate in the background, as shown here or as in COM or some other C & C++ codebases, or stop allocating from general purpose heap, create a specific arena for your large set of allocations, and deallocate the arena in one go.
Or you can have a GC that does a mark-and-sweep and have the same “spend a lot of time chasing pointers to a large number of allocations“ problem, but at a less predictable time.
Or you can have a GC that does a copy-and-compact, which might let you avoid the interior pointer chasing to a large number of allocations (if there’s nothing like finalizers that need invoking), but now you’ve got the opposite problem in that your performance is bounded by the size of the live set instead of what’s being deallocated.
What would a language that doesn’t have this issue do instead?
Zig makes this pretty easy.
The stdlib data structures, and anything else that needs to allocate, take an allocator parameter, which can be an arena (and one is provided by the stdlib.)
I know the c++ stdlib allows you to specify allocators etc, but I was wondering about a “more natural” way of doing it.
In many languages we anchor the constructor to a type: f = foo.New() or similar. That ctor will need to grab memory, which is generally stack, heap or custom allocator.
I guess I was wondering what it would be like if we did something like alloc through an owning object:
func handleRequest(r Request)
m = r.New(Map) // allocate a map owned by the request. When that goes, so does the map
f = m.New(foo) // allocate a foo owned by the map. When 'm' goes, so does f
m["myfoo"] = f
...do stuff...
return // neither m nor f cleaned up, but when caller kills 'r', they are thrown away
}
If the natural way to alloc was with an owning object, we’d get arena-allocation by default.
You’d need other ways of allocating to handle exceptional cases, but they could be done via a global or similar. Which seems to make sense, since an allocation which isn’t tied to an owning object (and needs to live beyond a request or equivalent) is global.
I assume C++ and rust (and Zig!) all allow this, but I’m wondering how far we can take “first class ownership, allocation and freeing” to enable the default use of memory pools.
Awesome tip and well written. It does make me worry that everyone reading this will insert this threading/drop behavior every time which is certainly an over optimization in most cases. Maybe I’m worrying over nothing though.
I agree that this is the sort of thing you should do after profiling performance and identifying it as an issue, and to be aware of this as a pitfall of large data structures with interior pointers during architecture/design of a system.
I would say that this is a kind of premature optimization in 99.9% cases. Where are we going as software developers if we have to spawn a thread to just release memory?
Maybe I’m just old fashioned or maybe I’m just shocked.
Are there any tools for Rust to make checked lifetimes more apparent to the programmer? Like is there a way to visualize the extent of a lifetime in any IDE?
I’ve seen this pattern in various RPC systems I’ve worked on where there will either be a request scoped arena or ‘trash bin’ which gets released after the RPC returns. Nice to see it works well for CLIs as well.
Interestingly it looks like in the fast case Go is much faster while in the slower case it is only slightly faster.
$ cargo run --release
Finished release [optimized] target(s) in 0.02s
Running `target/release/drop-thread`
drop in another thread 72.731µs
drop in this thread 277.387825ms
$ go run main.go
drop in another thread 13.101µs
drop in this thread 224ns
In Go the “drop in this thread” is ns, not µs. In Rust the “drop in this thread” is ms, not µs. So I’d guess that Go’s “drop in this thread” is probably not bothering to drop anything at all, since the program ends before the GC gets invoked. Depending on how you wrote it, of course.
So I’d guess that Go’s “drop in this thread” is probably not bothering to drop anything at all, since the program ends before the GC gets invoked.
Yes, it would be equivalent (in time) to using std::mem::forget in Rust (don’t do that!). It’s a source of bugs that pop up in some Go code that interacts with C libraries. People will use runtime.SetFinalizer to call a C function that finalizes some resource. However, there is no guarantee that the GC runs before the program ends and thus the finalizer may never run.
Comparing ‘drop in another thread’ between Go and Rust is also not so interesting. It’s measuring the time of starting a goroutine (cheap) vs. a native thread (more expensive).The time would probably go down for Rust when using a thread pool.
This is entertaining because it’s adding the laziness of GC back to Rust in a controllable way!
A great way of looking at it, yeah!
Looks like Evan Wallace ran into the same issue in practice in esbuild
https://news.ycombinator.com/item?id=22336284
ESBuild is a really impressive performance-oriented project:
https://github.com/evanw/esbuild
Objective-C is refcounted. One way you can decrement a refcount is
[obj release]
and that does what you expect, destroying the object when the refcount hits 0. But if you’re writing Objective-C in the first place you’re probably using it via Cocoa or UIKit which is probably running you in an event loop. In this case you can instead[obj autorelease]
which defers the refcount until this run of the event loop is over. The nice thing about that is that you can return very quickly from the current (probably interactive) action with your real work done and wait to do the bookkeeping work until you’d probably be waiting on I/O anyway.It would be nice if there were a way to structure rust arenas in such a way that the “current request”, whatever that means for your program, could gather up all of this pending work like this until you know you have some downtime. Or if you’re a short command line program, to never do it at all and wait for the process termination to clean everything up. This would really have to be at the language level since most of those
drop
insertions are going to be in third party library code.(I don’t write much Objective C or Swift and welcome corrections on the details)
Isn’t rust able to do refcount as well? https://doc.rust-lang.org/std/sync/struct.Arc.html
Yes but it isn’t a global “do refcounting everywhere”, it’s a utility function that lets you refcount a single object. You can’t force libraries to do it.
And https://doc.rust-lang.org/std/rc/struct.Rc.html
Uhm… I understand it’s a contrived example, but why on earth any such function that only needs to read something from HeavyThing’s state wouldn’t take it by reference? This would avoid the whole problem.
Then the caller would be the one that’s slow. The problem is still there, just in a different place.
But this wouldn’t happen at that critical time. Which is basically what a thread would do, too.
I mean, I believe there’s a problem, I’m just having a hard time imagining a practical example.
This is one of the reasons that chromium uses a GC for much of its c++. If you can schedule cleanup work you can significantly improve interactive performance.
What?
Surely this is wrong. Does Rust have this issue?
It’s not really about Rust. What Rust does is insert the deallocation code automatically when the owner goes out of scope, but C or C++ code would need to deallocate an equivalent structure at some point (or they could decide to never deallocate and just let it leak, which is also an option in Rust), and the cost of deallocation isn’t based on the language. Chasing 1,000,000 pointers to free what they’re pointing to is going to be expensive in any language.
This still seems weird…like, why is the caller freeing what’s passed in? Is there some additional context (say, this function is the last one being invoked in the lexical scope of the lifetime of the value being passed in) here we’re not being given?
Or is Rust really using copy-on-pass semantics that require it to free the copies made during function invocation?
Rust uses move semantics for owned objects, which in this case effectively means “the compiler knows how to figure out where to insert
free()
/drop()
calls for you”. The single-threaded example turns into:and the multi-threaded example is, as described:
This is Rust’s real secret sauce.
The caller isn’t freeing it. The caller is moving the value into the callee and gives up ownership. The callee is now the sole owner of the value and drops it implicitly (if it isn’t passed around further or passed back as a return value).
icefox and freddyb gave good answers and my Rust knowledge is not the deepest, but I wanted to address this part:
The main alternative to the “move” semantics when passing an argument to a function is to pass a reference instead, in which case the called function would be “borrowing” the value instead of owning it outright. In that case Rust would not insert the deallocation code at the end of the function because the compiler knows that the caller is not done with the value.
Depending on the purpose of the function and how it fits in with the rest of the code, another option for controlling when deallocation happens could be to pass in a reference and have a caller somewhere up the stack decide exactly when to drop it.
The other answers all cover aspects, but I feel they may not be clear enough.
When something is passed directly (not by reference) to a function, it will either be moved or copied (depending on whether the type implements the
Copy
trait, which is only possible for types which can be freed without chasing pointers). In the case of a move, the data now belongs to the callee function, and when the owner (the formal parameter name) goes out of scope (unless ownership is passed back to the caller in the function return, this would be at the end of the function), the Rust compiler inserts adrop
call to free the owned data (calling the type’sDrop
trait implementation). In the case of a copy, the same still happens, but the callee is working with a distinct copy of the data, so the caller’s copy of it is still valid and will be dropped when its own owner goes out of scope.Doesn’t everything have this issue? It’s just the old memory management trade-off of “would you rather your performance be bounded by the size of the set of things you’re deallocating, or the size of the live set?”, isn’t it?
You can deterministically deallocate things as their owner goes out of scope, which is sometimes going to involve chasing pointers to a large number of interior allocations at an inconvenient moment (like the vector-of-vectors here), so you might want to move ownership onto a less critical thread and deallocate in the background, as shown here or as in COM or some other C & C++ codebases, or stop allocating from general purpose heap, create a specific arena for your large set of allocations, and deallocate the arena in one go.
Or you can have a GC that does a mark-and-sweep and have the same “spend a lot of time chasing pointers to a large number of allocations“ problem, but at a less predictable time.
Or you can have a GC that does a copy-and-compact, which might let you avoid the interior pointer chasing to a large number of allocations (if there’s nothing like finalizers that need invoking), but now you’ve got the opposite problem in that your performance is bounded by the size of the live set instead of what’s being deallocated.
What would a language that doesn’t have this issue do instead?
Probably this:
I think it’s probably true for many applications that the common case is all pointers stored in a collection (transitively):
a) are allocated in the same place, specifically to be placed in that collection
b) are not stored anywhere else
c) have finalisers who are only chasing and freeing other pointers allocated in the same transitive tree rooted in the collection
and so releasing the collection would be quick+easy if this knowledge was given to the language/runtime.
A language which made this easy would do well here.
We’re paying for the cost of the fully general case all the time, when we probably don’t need to.
Zig makes this pretty easy. The stdlib data structures, and anything else that needs to allocate, take an allocator parameter, which can be an arena (and one is provided by the stdlib.)
Thanks, that’s interesting!
I know the c++ stdlib allows you to specify allocators etc, but I was wondering about a “more natural” way of doing it.
In many languages we anchor the constructor to a type:
f = foo.New()
or similar. That ctor will need to grab memory, which is generally stack, heap or custom allocator.I guess I was wondering what it would be like if we did something like alloc through an owning object:
If the natural way to alloc was with an owning object, we’d get arena-allocation by default.
You’d need other ways of allocating to handle exceptional cases, but they could be done via a global or similar. Which seems to make sense, since an allocation which isn’t tied to an owning object (and needs to live beyond a request or equivalent) is global.
I assume C++ and rust (and Zig!) all allow this, but I’m wondering how far we can take “first class ownership, allocation and freeing” to enable the default use of memory pools.
Awesome tip and well written. It does make me worry that everyone reading this will insert this threading/drop behavior every time which is certainly an over optimization in most cases. Maybe I’m worrying over nothing though.
I agree that this is the sort of thing you should do after profiling performance and identifying it as an issue, and to be aware of this as a pitfall of large data structures with interior pointers during architecture/design of a system.
Pro tip: You can do the exact same thing in C as well.
Proof is left to the reader.
I would say that this is a kind of premature optimization in 99.9% cases. Where are we going as software developers if we have to spawn a thread to just release memory? Maybe I’m just old fashioned or maybe I’m just shocked.
Are there any tools for Rust to make checked lifetimes more apparent to the programmer? Like is there a way to visualize the extent of a lifetime in any IDE?
This is horrible and I love it.
I’ve seen this pattern in various RPC systems I’ve worked on where there will either be a request scoped arena or ‘trash bin’ which gets released after the RPC returns. Nice to see it works well for CLIs as well.
Interestingly it looks like in the fast case Go is much faster while in the slower case it is only slightly faster.
In Go the “drop in this thread” is ns, not µs. In Rust the “drop in this thread” is ms, not µs. So I’d guess that Go’s “drop in this thread” is probably not bothering to drop anything at all, since the program ends before the GC gets invoked. Depending on how you wrote it, of course.
Yes, it would be equivalent (in time) to using
std::mem::forget
in Rust (don’t do that!). It’s a source of bugs that pop up in some Go code that interacts with C libraries. People will useruntime.SetFinalizer
to call a C function that finalizes some resource. However, there is no guarantee that the GC runs before the program ends and thus the finalizer may never run.Comparing ‘drop in another thread’ between Go and Rust is also not so interesting. It’s measuring the time of starting a goroutine (cheap) vs. a native thread (more expensive).The time would probably go down for Rust when using a thread pool.
Haha, good catch! :)