A minor issue, but the Go program has a data race. The i loop variable is captured in the spawned goroutines while it is mutated by the outer loop. Fixing it doesn’t change the results of the program. It’s also immediately caught when built with the -race flag. Rust being data race free by construction is a super power.
Changing the amount of spawned goroutines/threads to 50,000 has a noticeable difference between the two on my machine. The Go program still completes in a timely fashion with ~5x the rss and no negative impacts on the rest of my system:
real 11.25s
user 41.49s
sys 0.70s
rss 135136k
but the Rust version immediately causes my system to come to a crawl. UIs stop responding, audio starts stuttering, and eventually it crashes:
thread 'main' panicked at 'failed to spawn thread: Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }', src/libcore/result.rs:1189:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.
Command exited with non-zero status 101
real 16.39s
user 23.76s
sys 98.50s
rss 288132k
So while it may not be significantly lighter than threads as far as this one measure is concerned, there is definitely some resource that threads are taking up that goroutines are not. I checked on some production monitoring systems to see how many active goroutines they had, and they were sitting comfortably in the 20-30k area. I wonder if they were using threads how they would fare.
Rust being data race free by construction is a super power.
Well, it also necessarily prohibits entire classes of useful and productive patterns… it was a design decision, with pros and cons, but certainly not strictly better.
Strong plus +1. I wish, for application development, it was possible to remove lifetimes, borrow-checker and manual memory management (which are not that useful for this domain) and to keep only fearless concurrency (which is useful even (more so?) in high-level languages). Alas, it seems that Rust’s thread-safety banana is tightly attached to the rest of the jungle.
I’d be interested to see how many threads ended up scheduled with the Go example. My guess is not many al all: IIRC the compiler recognizes those sleeps as a safe place to yield control from the goroutines. So I suspect you end up packing them densely on to threads.
I think threads are better than a lot of people think, but in this case the measurement strikes me as a bit naive. Go very likely consumes more memory than strictly needed because of the GC overhead. The actual memory occupied by goroutines could be less than half the memory used in total, but here we only see goroutine overhead + GC overhead without a way of separating them.
The benchmark doesn’t seem to have many heap objects, and I thought Go used a conservative/imprecise GC anyway so objects don’t have GC space overhead (mark bits, etc.). Although I think the GC changed many times and I haven’t followed all the changes.
OK that’s what I suspected, but as far as I can tell it’s a moot point because the benchmark doesn’t have enough GC’d objects for it to matter. Happy to be corrected.
I think when people speak of GC overhead they’re not talking about the per-object overhead of tracking data. Rather, the total amount of memory space one needs to set aside to cover both memory actually in use, and memory no longer in use but not yet collected: garbage. This can often be quite a lot larger than the working set, especially if there is a lot of churn.
I see that each goroutine calculates a hash and calls time.Sleep(). I don’t see any garbage (or at least not enough to know what the grandparent comment is talking about)
I’ve always been told that each OS thread consumes kernel resources as well. RSS isn’t going to account for that. Anyone have an idea how much that is? (Of course it’ll vary by OS.)
And of course the article doesn’t claim to consider performance, but creating, destroying or switching OS threads requires a syscall, whereas “green” threads don’t.
That’s true, and that indeed was missed in the article, mea culpa. In particular, the memory for page tables themselves is allocated eagerly and not counted towards RSS. In this example, 10k threads require about 40mb of page tables (looked at the /proc/pid/status).
And of course the article doesn’t claim to consider performance, but creating, destroying or switching OS threads requires a syscall, whereas “green” threads don’t.
For context switches, the story is interesting, and I haven’t seen conclusive benchmarks there. Goroutine to goroutine switch indeed is massively faster than thread to thread, as we don’t have cooperative scheduling API. But often, the reason for the switch is IO, and for that goroutines do more work (as both the kernel and the user space scheduler are involved). How this pans out in practice? I suspect strongly in favor of goroutines, but I haven’t seen a conclusive benchmark.
I think on modern systems that size only controls maximum. Threads start with two pages of memory (data + guard page), and actual RAM allocation grows dynamically when stack hits the guard page.
There are two somewhat conflated things that get consumed:
Memory
Virtual address space.
By default, on FreeBSD (probably elsewhere?) a new thread consumes 8 MiB of virtual address space for its stack. That’s 23 bits of address space. 2^13 threads (around 10K, rounded to a power of two to make the maths simpler) consumes 36 bits of address space. On a 64-bit system you typically have 47-48 bits of address space for a userspace process, so this is completely trivial. On a 32-bit system you typically have 31 bits of address space for userspace and so now it’s gone (on FreeBSD, stacks are only 4 MiB on 32-bit systems, so this would actually only consume 35 bits of address space - still more than you have).
Most operating systems; however, lazily allocate physical memory to back virtual address space. On FreeBSD, stacks are mapped with MAP_STACK, which tells the OS that the usage will grow downwards, which helps it to allocate physical pages that are more amenable to transparent superpage promotion if the stack actually does grow, but it will still (typically) allocate only one page initially. On Windows, this is a lot more complex and the userspace process actually communicates with the kernel how much stack it’s using (on most *NIX systems, the kernel doesn’t actually know that a memory mapping is a thread’s stack, it’s an entirely userspace construct and userspace is entirely free to pivot the stack to a different allocation at any point).
Most of the time, I care about memory usage, not virtual address space usage.
The article is explicit about not discussing context switches. The point is to succinctly demonstrate that Linux threads do not use orders of magnitude more RAM. It isn’t a discussion of the tradeoffs between various concurrency models.
I’d love to read an article that quantifies the overall macro difference between threads, stackful coroutines and stackless coroutines myself.
The article is explicit about not discussing context switches. The point is to succinctly demonstrate that Linux threads do not use orders of magnitude more RAM.
Well, OK, but then the title is more than a little disingenuous, isn’t it?
I don’t think so. To me, the title seems to be a succinct summary of the content. I do wish I had s/light/small/ before publishing, but “small” didn’t occur to me back then. Nevertheless, light/heavy is used when talking about size in memory, and my usage seems to be OK.
It also true that people often (more often) use “light” to describe overall performance difference in this context, and the title is ambiguous. There’s the first paragraph for disambiguation.
Aesthetically, I just love loud ambiguous titles, a-la “Linear Types Can Change the World” :-) I understand that that leads to failure in communication in some cases, but don’t see this as a big problem.
I wonder if this would still be true if the coroutines involved are doing non-trivial examples (channel motions, waiting on an input device, math, etc.). You know, subjecting the ’routines to the ravages of reality.
It would be more true I would think: the benchmark measures pure overhead, if you add more stuff that consumes memory, then the relative difference will be small.
Although, given that the examples you give are related to processing, and not memory, I want to emphasize that the benchmark is valid only for memory usage. It would be wrong to use it for measuring speed difference.
The most commonly cited drawback of OS-level threads is that they use a lot of RAM.
Er, when people say “goroutines are lighter than threads” they are not speaking exclusively to memory. “Lightness” is much more about cost to create, switch between, and destroy. And those things are the main drawbacks of OS threads, in my experience.
From what I’ve heard from the Go language authors before, the main point of concurrency in Go is not performance, but to provide a way to structure programs.
I’m guessing here, but I think that is why there are no counter arguments.
Here’s two things I noticed about this benchmark:
A minor issue, but the Go program has a data race. The
i
loop variable is captured in the spawned goroutines while it is mutated by the outer loop. Fixing it doesn’t change the results of the program. It’s also immediately caught when built with the-race
flag. Rust being data race free by construction is a super power.Changing the amount of spawned goroutines/threads to 50,000 has a noticeable difference between the two on my machine. The Go program still completes in a timely fashion with ~5x the rss and no negative impacts on the rest of my system:
but the Rust version immediately causes my system to come to a crawl. UIs stop responding, audio starts stuttering, and eventually it crashes:
So while it may not be significantly lighter than threads as far as this one measure is concerned, there is definitely some resource that threads are taking up that goroutines are not. I checked on some production monitoring systems to see how many active goroutines they had, and they were sitting comfortably in the 20-30k area. I wonder if they were using threads how they would fare.
Well, it also necessarily prohibits entire classes of useful and productive patterns… it was a design decision, with pros and cons, but certainly not strictly better.
Strong plus +1. I wish, for application development, it was possible to remove lifetimes, borrow-checker and manual memory management (which are not that useful for this domain) and to keep only fearless concurrency (which is useful even (more so?) in high-level languages). Alas, it seems that Rust’s thread-safety banana is tightly attached to the rest of the jungle.
FWIW. Ponylang has exactly this:
send
data) Mark-and-no-sweep GCSend
andSync
traits in its reference capability system which provide the same static guaranteesAs with everything though it has its own trade offs; Whether that be in its ref-cap system, lack of explicit control in the system, etc.
To prevent that Rust error, I believe you need to change
vm.max_map_count
: https://github.com/rust-lang/rust/issues/78497#issuecomment-730055721That seems to me like the kind of thing Rust should do to some extent on its own!
I’d be interested to see how many threads ended up scheduled with the Go example. My guess is not many al all: IIRC the compiler recognizes those sleeps as a safe place to yield control from the goroutines. So I suspect you end up packing them densely on to threads.
I think threads are better than a lot of people think, but in this case the measurement strikes me as a bit naive. Go very likely consumes more memory than strictly needed because of the GC overhead. The actual memory occupied by goroutines could be less than half the memory used in total, but here we only see goroutine overhead + GC overhead without a way of separating them.
What overhead specifically? Stack maps for GC?
The benchmark doesn’t seem to have many heap objects, and I thought Go used a conservative/imprecise GC anyway so objects don’t have GC space overhead (mark bits, etc.). Although I think the GC changed many times and I haven’t followed all the changes.
The GC has been fully precise since Go 1.4.
OK that’s what I suspected, but as far as I can tell it’s a moot point because the benchmark doesn’t have enough GC’d objects for it to matter. Happy to be corrected.
I think when people speak of GC overhead they’re not talking about the per-object overhead of tracking data. Rather, the total amount of memory space one needs to set aside to cover both memory actually in use, and memory no longer in use but not yet collected: garbage. This can often be quite a lot larger than the working set, especially if there is a lot of churn.
Where does that exist in the benchmark?
I see that each goroutine calculates a hash and calls time.Sleep(). I don’t see any garbage (or at least not enough to know what the grandparent comment is talking about)
The space can be reserved for future collections anyway. It’s well known that GCs require more space than strictly needed for alive objects.
Alternative title: Threads incur 3x more memory overhead than coroutines, but it’s still not very much, so you probably won’t mind.
*on Linux. But I don’t feel particularly bamboozled by the title.
Sounds more like an alternative opening paragraph :)
True :)
I’ve always been told that each OS thread consumes kernel resources as well. RSS isn’t going to account for that. Anyone have an idea how much that is? (Of course it’ll vary by OS.)
And of course the article doesn’t claim to consider performance, but creating, destroying or switching OS threads requires a syscall, whereas “green” threads don’t.
That’s true, and that indeed was missed in the article, mea culpa. In particular, the memory for page tables themselves is allocated eagerly and not counted towards RSS. In this example, 10k threads require about 40mb of page tables (looked at the
/proc/pid/status
).I think the bigger issue are not context-switches, but rather non-memory resources. As in, you can’t spawn 100k threads without tweaking system’s config: https://github.com/jimblandy/context-switch#running-tests-with-large-numbers-of-threads.
For context switches, the story is interesting, and I haven’t seen conclusive benchmarks there. Goroutine to goroutine switch indeed is massively faster than thread to thread, as we don’t have cooperative scheduling API. But often, the reason for the switch is IO, and for that goroutines do more work (as both the kernel and the user space scheduler are involved). How this pans out in practice? I suspect strongly in favor of goroutines, but I haven’t seen a conclusive benchmark.
The post should mention you can choose the stack size of a rust thread using the thread builder api.
You need to size a threads stack up front but goroutines can grow its stack as it gos.
I think on modern systems that size only controls maximum. Threads start with two pages of memory (data + guard page), and actual RAM allocation grows dynamically when stack hits the guard page.
good to know! That makes a lot of sense.
There are two somewhat conflated things that get consumed:
By default, on FreeBSD (probably elsewhere?) a new thread consumes 8 MiB of virtual address space for its stack. That’s 23 bits of address space. 2^13 threads (around 10K, rounded to a power of two to make the maths simpler) consumes 36 bits of address space. On a 64-bit system you typically have 47-48 bits of address space for a userspace process, so this is completely trivial. On a 32-bit system you typically have 31 bits of address space for userspace and so now it’s gone (on FreeBSD, stacks are only 4 MiB on 32-bit systems, so this would actually only consume 35 bits of address space - still more than you have).
Most operating systems; however, lazily allocate physical memory to back virtual address space. On FreeBSD, stacks are mapped with
MAP_STACK
, which tells the OS that the usage will grow downwards, which helps it to allocate physical pages that are more amenable to transparent superpage promotion if the stack actually does grow, but it will still (typically) allocate only one page initially. On Windows, this is a lot more complex and the userspace process actually communicates with the kernel how much stack it’s using (on most *NIX systems, the kernel doesn’t actually know that a memory mapping is a thread’s stack, it’s an entirely userspace construct and userspace is entirely free to pivot the stack to a different allocation at any point).Most of the time, I care about memory usage, not virtual address space usage.
This entire article seems to miss the point of goroutines, which is to make context switching between them cheap by avoiding system calls.
Making the workload be “sleeps” is not at all a showcase because they’d just wind up putting threads to sleep.
The article is explicit about not discussing context switches. The point is to succinctly demonstrate that Linux threads do not use orders of magnitude more RAM. It isn’t a discussion of the tradeoffs between various concurrency models.
I’d love to read an article that quantifies the overall macro difference between threads, stackful coroutines and stackless coroutines myself.
Well, OK, but then the title is more than a little disingenuous, isn’t it?
I don’t think so. To me, the title seems to be a succinct summary of the content. I do wish I had s/light/small/ before publishing, but “small” didn’t occur to me back then. Nevertheless, light/heavy is used when talking about size in memory, and my usage seems to be OK.
It also true that people often (more often) use “light” to describe overall performance difference in this context, and the title is ambiguous. There’s the first paragraph for disambiguation.
Aesthetically, I just love loud ambiguous titles, a-la “Linear Types Can Change the World” :-) I understand that that leads to failure in communication in some cases, but don’t see this as a big problem.
The problem is that goroutines are significantly lighter than threads, when using the more common definition of light. That’s not ambiguity.
I wonder if this would still be true if the coroutines involved are doing non-trivial examples (channel motions, waiting on an input device, math, etc.). You know, subjecting the ’routines to the ravages of reality.
It would be more true I would think: the benchmark measures pure overhead, if you add more stuff that consumes memory, then the relative difference will be small.
Although, given that the examples you give are related to processing, and not memory, I want to emphasize that the benchmark is valid only for memory usage. It would be wrong to use it for measuring speed difference.
Er, when people say “goroutines are lighter than threads” they are not speaking exclusively to memory. “Lightness” is much more about cost to create, switch between, and destroy. And those things are the main drawbacks of OS threads, in my experience.
Well I guess it’s settled then.
Could you elaborate this a bit?
From what I’ve heard from the Go language authors before, the main point of concurrency in Go is not performance, but to provide a way to structure programs.
I’m guessing here, but I think that is why there are no counter arguments.
Can somebody compare with Java threads too :P
If you’re going to do that, make sure to do Loom fibres as well
Hmm. Are Java threads that far an abstraction above OS threads?
no, they aren’t.
[Comment removed by author]