Makes me really happy to see node come out in front of the non-compiled on the higher end benchmarks. There’s so much trash talk out there about it as a runtime and a language, but in the async do-a-ton-of-things domain, it totally holds its own.
But native Linux threads used from Rust seem to be lightweight enough that at 10k threads the memory consumption is still lower than the idle memory consumption of many other runtimes
IIUC, Linux threads are copy-on-write, so launching a bunch of threads is essentially free. You don’t pay until the thread contents diverge.
Just to add: so, a thread that doesn’t do much will use 1 page (usually 4kB, or 16kB on ARMv8) and about 8MB of address space (glibc default, you can configure it when creating a thread, apparently musl defaults to 128kB). Plus about… I’m guessing roughly 128 bytes ish for the page table entries themselves? Someone please correct me on the size of page table entries, I would love to know.
It’s probably a bit more than that. A thread also has:
Some kernel data structures for accounting and so on. At the very least, this must be sufficient to hold the saved thread state (register file, signal mask, and so on). The register state for AVX (not AVX-512) is 1 KiB, I think some cores with AVX-512 can have a bit over 3 KiB in total.
A pthread_t structure owned by the threading library (typically libc) for all of the userspace state associated with a thread.
Some space allocated for the TLS region. If you have a lot of thread-local storage, this can be more than the stack for small threads.
I’m not sure what glibc does, some threading implementations allocate the thread stack, the initial-exec / local-exec TLS space, and the thread control structure in the same place, which may let you fit all of them into a page. Linux doesn’t care: the only thing that the clone system call knows about is the start of the stack and the initial code pointer, so you can put everything else on a page after the stack if you’re not worried about stack buffer overflows corrupting the state.
The size of the PTEs is variable and depends a bit on where the stack ends up in the address space. With a five-level page table, if it’s in a completely new place in the address space then it will use four new 4 KiB pages. More commonly, it will either go into an existing leaf or allocate just one more leaf. This means that the page table space used will be either zero or 4 KiB. For an 8 MiB stack, you need 2048 PTEs. Each PTE is (currently) 8 bytes, so that’s 16 KiB. That’s four fully populated page table pages, plus four 8-byte entries on another page (or two if they’re poorly aligned) to point to them.
This is where it gets a bit more interesting. As I recall, on Linux, the page table itself is the canonical data structure but I think it does some sharing. In particular, I believe that full page tables pointing to CoW zero pages are, themselves, a CoW page, so three of these page tables will all be pointing to a single statically-allocated page full of not-present entries. On FreeBSD (and other systems that use a Mach-derived VM subsystem, such as XNU), the page table is not the canonical data structure and is lazily generated from a range tree. The mapping will create a vm_object_t and, as pages are touched, this will be populated with vm_page_ts, which will be used to populate the real page tables.
The register state for AVX (not AVX-512) is 1 KiB, I think some cores with AVX-512 can have a bit over 3 KiB in total
It’s not quite that big. AVX is 16 registers * 32 bytes each = 0.5k. AVX512 is 32 registers * 64 bytes each = 2k; and a napkin-back estimate of the entire rest of the architectural state is just 384 bytes.
Furthermore—I haven’t looked into details here, but—the hardware tracks state dirtying with fairly reasonable granularity so that, on context switch, it doesn’t have to save/restore much more than it has to (see ‘xsaveopt’); it wouldn’t surprise me if this were exposed to the kernel, which could use it to only allocate space for that state when it’s used. Avx512 is definitely useful for scalar code, but you don’t necessarily need the full state gamut; can get away with just the low halves of a few registers.
pthread_t
I wouldn’t expect this to be more than a few words.
TLS region
I don’t know if it is, but I see no reason why this couldn’t be cow.
Furthermore—I haven’t looked into details here, but—the hardware tracks state dirtying with fairly reasonable granularity so that, on context switch, it doesn’t have to save/restore much more than it has to (see ‘xsaveopt’); it wouldn’t surprise me if this were exposed to the kernel, which could use it to only allocate space for that state when it’s used. Avx512 is definitely useful for scalar code, but you don’t necessarily need the full state gamut; can get away with just the low halves of a few registers.
I believe Linux now disables lazy context switching because it introduces some exploitable side channels, but even without that you really don’t want to be allocating memory in a context switch path so there’s alway space reserved for the entire register context.
I wouldn’t expect this to be more than a few words.
I guess that depends a bit on your definition of ‘a few’. I couldn’t find the glibc version, but the FreeBSD one looks to be a significant chunk of a page and a lot of that is space for things that POSIX mandates.
I don’t know if it is, but I see no reason why this couldn’t be cow.
If it is, it will fault immediately, since some of the fields (such as the stack canary) are initialised on thread creation. This is often the same allocation as the thread structure, but if you made it CoW then it would need to be a multiple of page size and that would cause even more overhead. I’m not aware of any libc that has even attempted that.
More commonly, it will either go into an existing leaf or allocate just one more leaf. This means that the page table space used will be either zero or 4 KiB.
In context we’re talking about large numbers of threads, so the outcome is going to be a fraction of the size of a leaf?
No, you still need to reserve the address space, so if you can’t pack them in. Even if it’s only one PTE, you will likely need one page table page for each stack. The higher levels of the table will be amortised though.
Has anyone got any insights into the C# outcome? Issues with the test aside, I was really surprised, is there a gotcha or is it genuinely this low overhead?
Another important point is that all solutions are low overhead. Like, 3GiB for a million concurrent connections is nothing, relatively speaking. Memory isn’t the limiting factor here.
Read the comments on the post. The post actually runs 2 million C# tasks instead of 1 million, so the memory usage is actually half what the blog claimed.
But that just raises even more questions haha! I was surprised to begin with that C# did so well, doubly so if they do so well with 2x the task than the others.
Matklads reply above seems to indicate that it’s just straight up a solid implementation by the .NET folks
I was also pleasantly surprised by those results. I wonder how much of that is a benefit of both owning the OS implementation and the language implementation? I’ll put it on my list of questions for the next time I talk with someone from the CLR group.
The individual cost of Tasks is very low in C#. The reason why memory usage starts high is that the current “standard” server configuration of the .NET GC is to reserve a chunk of memory for the GC for each core. Many .NET server apps will end up allocating a bit of memory, so it can be more efficient to just reserve things up front.
Seems reasonable in my eyes, though I only have 6 months of professional experience with Rust. GPT-4 even gets the dependency on tokio right. I can’t help but think there must be a way to join all the handles at once, but I haven’t looked into it.
on my machine. So it is a little bit less than 1 GiB, which tops the Go code by factor of 2 using the Go code from the example (2660564992, so almost 2.5 GiB) on my machine, and it seems that it would put in on par with Java Virtual Threads.
I was surprised as well, but all of that with almost negligible performance impact and on my machine (with bunch of other stuff going in the background, so take that with grain of salt) Erlang (OTP 25.3.2 without JIT enabled) code was consistently faster by about 2s than Go code (Go 1.20.4).
Presumably Go uses a runtime stack because it’s a simple way to properly order defers inside loops and conditional defers.
Also because defers act as panic (exception) handlers. That go’s defers are function scoped regularly takes people by surprise, especially because it’s easy to accumulate defers in a loop and get what amounts to a resource leak.
I quite intentionally written it in Erlang to omit starting up and launching all the additional applications that Elixir uses (for example logger) to reduce the memory footprint.
“With a little help of ChatGPT I could write such program in a few minutes, even in programming languages I don’t use every day.”
What interests you about that comment
Any reason node v12 was used instead of node v18? I can understand not using node v20, as it only recently came out, but node v12 is very old.
That’s the version in the Ubuntu 22.04 package archives.
I’m not sure that is a good enough reason
Makes me really happy to see node come out in front of the non-compiled on the higher end benchmarks. There’s so much trash talk out there about it as a runtime and a language, but in the async do-a-ton-of-things domain, it totally holds its own.
IIUC, Linux threads are copy-on-write, so launching a bunch of threads is essentially free. You don’t pay until the thread contents diverge.
ISTM you are confusing threads and processes? Processes are COW, threads just share all resources.
What matters for threads is that stack is eagerly mmaped, but lazily faulted into existence.
Ah, thanks, you’re right, I was confusing threads and processes.
That’s OK … Linux confused threads and processes for at least a decade.
Just to add: so, a thread that doesn’t do much will use 1 page (usually 4kB, or 16kB on ARMv8) and about 8MB of address space (glibc default, you can configure it when creating a thread, apparently musl defaults to 128kB). Plus about… I’m guessing roughly 128 bytes ish for the page table entries themselves? Someone please correct me on the size of page table entries, I would love to know.
It’s probably a bit more than that. A thread also has:
I’m not sure what glibc does, some threading implementations allocate the thread stack, the initial-exec / local-exec TLS space, and the thread control structure in the same place, which may let you fit all of them into a page. Linux doesn’t care: the only thing that the clone system call knows about is the start of the stack and the initial code pointer, so you can put everything else on a page after the stack if you’re not worried about stack buffer overflows corrupting the state.
The size of the PTEs is variable and depends a bit on where the stack ends up in the address space. With a five-level page table, if it’s in a completely new place in the address space then it will use four new 4 KiB pages. More commonly, it will either go into an existing leaf or allocate just one more leaf. This means that the page table space used will be either zero or 4 KiB. For an 8 MiB stack, you need 2048 PTEs. Each PTE is (currently) 8 bytes, so that’s 16 KiB. That’s four fully populated page table pages, plus four 8-byte entries on another page (or two if they’re poorly aligned) to point to them.
This is where it gets a bit more interesting. As I recall, on Linux, the page table itself is the canonical data structure but I think it does some sharing. In particular, I believe that full page tables pointing to CoW zero pages are, themselves, a CoW page, so three of these page tables will all be pointing to a single statically-allocated page full of not-present entries. On FreeBSD (and other systems that use a Mach-derived VM subsystem, such as XNU), the page table is not the canonical data structure and is lazily generated from a range tree. The mapping will create a
vm_object_t
and, as pages are touched, this will be populated withvm_page_t
s, which will be used to populate the real page tables.It’s not quite that big. AVX is 16 registers * 32 bytes each = 0.5k. AVX512 is 32 registers * 64 bytes each = 2k; and a napkin-back estimate of the entire rest of the architectural state is just 384 bytes.
Furthermore—I haven’t looked into details here, but—the hardware tracks state dirtying with fairly reasonable granularity so that, on context switch, it doesn’t have to save/restore much more than it has to (see ‘xsaveopt’); it wouldn’t surprise me if this were exposed to the kernel, which could use it to only allocate space for that state when it’s used. Avx512 is definitely useful for scalar code, but you don’t necessarily need the full state gamut; can get away with just the low halves of a few registers.
I wouldn’t expect this to be more than a few words.
I don’t know if it is, but I see no reason why this couldn’t be cow.
I believe Linux now disables lazy context switching because it introduces some exploitable side channels, but even without that you really don’t want to be allocating memory in a context switch path so there’s alway space reserved for the entire register context.
I guess that depends a bit on your definition of ‘a few’. I couldn’t find the glibc version, but the FreeBSD one looks to be a significant chunk of a page and a lot of that is space for things that POSIX mandates.
If it is, it will fault immediately, since some of the fields (such as the stack canary) are initialised on thread creation. This is often the same allocation as the thread structure, but if you made it CoW then it would need to be a multiple of page size and that would cause even more overhead. I’m not aware of any libc that has even attempted that.
Thanks!
In context we’re talking about large numbers of threads, so the outcome is going to be a fraction of the size of a leaf?
No, you still need to reserve the address space, so if you can’t pack them in. Even if it’s only one PTE, you will likely need one page table page for each stack. The higher levels of the table will be amortised though.
The Elixir numbers look unfamiliar to me. Plus linking processes takes some extra toil and perhaps the benchmark should consider this.
Does the Node.js example actually run 1 million tasks, or does it merely start 1 million timers?
This is a great experiment but the dysfunctional grad student in me cries at how useless the graphs are. Give us some deltas, jeez!
Has anyone got any insights into the C# outcome? Issues with the test aside, I was really surprised, is there a gotcha or is it genuinely this low overhead?
There are three classes of programs here:
These go from “more memory”, to “less memory”. So, yeah, C# is in the most memory efficient category.
The Stackless category is further subdivided into how frequently frames are allocated onto the heap. The choices here are:
This subdivision is about CPU overhead, not memory usage; it won’t be visible in the sleep benchmark anyway, as there’s no calls at all.
Another important point is that all solutions are low overhead. Like, 3GiB for a million concurrent connections is nothing, relatively speaking. Memory isn’t the limiting factor here.
Read the comments on the post. The post actually runs 2 million C# tasks instead of 1 million, so the memory usage is actually half what the blog claimed.
But that just raises even more questions haha! I was surprised to begin with that C# did so well, doubly so if they do so well with 2x the task than the others.
Matklads reply above seems to indicate that it’s just straight up a solid implementation by the .NET folks
I was also pleasantly surprised by those results. I wonder how much of that is a benefit of both owning the OS implementation and the language implementation? I’ll put it on my list of questions for the next time I talk with someone from the CLR group.
I think it’s safe to assume that the C# benchmark was run on Ubuntu like the others, since .NET has been portable for a few years.
The individual cost of Tasks is very low in C#. The reason why memory usage starts high is that the current “standard” server configuration of the .NET GC is to reserve a chunk of memory for the GC for each core. Many .NET server apps will end up allocating a bit of memory, so it can be more efficient to just reserve things up front.
Hm, am I missing something or is the Rust code missing in the linked benchmark sources?
With a little help from ChatGPT you, too, can hallucinate benchmark results!
Here’s a temporary pastebin with that program in Rust generated by both ChatGPT’s. Is this good or bad Rust? Or something else entirely?
Seems reasonable in my eyes, though I only have 6 months of professional experience with Rust. GPT-4 even gets the dependency on tokio right. I can’t help but think there must be a way to join all the handles at once, but I haven’t looked into it.
Pasting my comment that was update response to this comment
I was surprised as well, but all of that with almost negligible performance impact and on my machine (with bunch of other stuff going in the background, so take that with grain of salt) Erlang (OTP 25.3.2 without JIT enabled) code was consistently faster by about 2s than Go code (Go 1.20.4).
Changing Go from
to
Improves Go CPU time by about 2X. Go’s defer is a runtime thing, and would have some impact in a micro bench.
For anyone else wandering why this is:
How’s defer pushes the function call onto a stack at runtime and then calls all the functions in the stack at function exit.
Some other languages (like zig) have a defer statement that is executed at scope exit, and these are easily implemented without runtime support.
Presumably Go uses a runtime stack because it’s a simple way to properly order defers inside loops and conditional defers.
Also because defers act as panic (exception) handlers. That go’s defers are function scoped regularly takes people by surprise, especially because it’s easy to accumulate defers in a loop and get what amounts to a resource leak.
Nice, I knew that something must be off.
I translated your code to Elixir, my 3yo macbook reports:
using
I quite intentionally written it in Erlang to omit starting up and launching all the additional applications that Elixir uses (for example
logger
) to reduce the memory footprint.