The article buries the lede: they built a Rust async runtime that is basically a wrapper around Grand Central Dispatch on macOS! This is the operating system task scheduler, which means they basically get the dispatch behavior of a native Mac application. Very cool stuff!
Ah this is quite interesting actually. The major dichotomy I see in languages tends to be 1:1 threading (or just “threads”) and M:N threading (often called “async”; M userspace threads multiplexed over N kernel threads, like green threads, goroutines, etc.). I’ve also seen async refer to M:1 threading (coroutines/state machines/event loops on a single thread like Nginx or Node.js).
I hadn’t had much opportunity to empirically compare the different approaches myself, but the main argument I heard in favor of 1:1 threading is that an M:N model creates a “left hand doesn’t know what the right hand is doing” situation, frustrating the kernel’s ability to take advantage of its knowledge when scheduling, and that avoiding context switches or general threading overhead isn’t the bugbear it used to be anyway. So seeing what looks like an additional option here—using Rust’s async to schedule threads M:N, but with the kernel still steering the scheduling—sounds appealing.
This does remind me of Google’s switchto proposal alongside futexes to facilitate user-space threading as well.
I’m not particularly well informed or up-to-date on these things so happy to hear if I accidentally misrepresented something.
M:N threading (often called “async” […] I’ve also seen async refer to M:1 threading
Asynchronicity is orthogonal to threading. For extra confusion, “async” these days tends to mean async/await (state machines built from “normal looking” code), as opposed to stackful coroutines.
with the kernel still steering the scheduling
Ehhh, AFAIK this isn’t any more kernel-based than any other common scheduler. libdispatch is just Apple’s userspace library that provides a scheduler and other concurrency stuff, and this article describes using it because it’s “platform native” and has better integration with Apple’s frameworks and the power management and whatnot.
This sounds roughly equivalent to running on glib’s event loop which would be “platform native” for a GNOME application.
Either way, it is indeed great to see the flexibility of Rust’s async/await in action!
One perhaps-not-so-well-named innovation arriving in the server space is a little “more kernel-based” in a way: the “thread-per-core” runtimes. Basically: lots of server applications are pretty much shared-nothing, so let’s just spin up completely independent event loops on $num_cpus threads, and (here’s the kernel part) let SO_REUSEPORT distribute requests between them.
(The linked runtimes also all switch to completion-based I/O—because io_uring go fast wooo—which results in ecosystem compatibility pain due to different read/write interfaces; but you can also replicate the architecture itself in a few lines with Tokio as the TechEmpower benchmark entries do)
For extra confusion, “async” these days tends to mean async/await (state machines built from “normal looking” code), as opposed to stackful coroutines.
You’re right, but usually people contrast async/await with virtual threads, not stackful coroutines, which are different constructs. The grandparent is more confused though, in that there’s no connection between the interface exposed to users (virtual threads, stackful/stackless coroutines, actors, whatever) and the scheduling mechanism (1:1, N:1, M:N, etc).
Basically: lots of server applications are pretty much shared-nothing, so let’s just spin up completely independent event loops on $num_cpus threads, and (here’s the kernel part) let SO_REUSEPORT distribute requests between them.
But as I wrote in the article you linked: a lot of the times these cite benchmarks which do a well-balanced amount of work on each connection, so load balancing by distributing connections evenly with SO_REUSEPORT is an effective strategy. Unless your real workload also has that property, its questionable to cite these benchmarks in contrast to a more responsive (but also higher constant overhead) strategy like work-stealing.
That is to say, its not really about being shared-nothing (though thats a prerequisite for avoiding coordination between threads), its about whether merely balancing connections is an effective strategy for balancing work in your application.
Right, it’s definitely not a property every workload has, but I feel like it’s not uncommon at all. For web app backends especially it’s typical for all request handling to be essentially “fire off queries to DB/cache/third-party-service/etc, await, template/serialize, return” which results in little variety of work per request (and so—with some handwaving about clients doing roughly similar numbers of requests—per connection too). So to me it makes sense that that approach is getting some traction.
…oh, and of course the thing is that it’s kind of a tried-and-true architecture for web apps: all those GIL-burdened scripting language interpreters were often being scaled in an OS process per core way (or “just some larger number of worker processes than cores” when each process was running fully synchronously instead of having an async event loop). Those got away with merely balancing connections!
I don’t believe that it’s common for web clients perform roughly similar numbers of request per connection, and if the number of requests you perform per request is not the same that is also an imbalance of work even for stateless services.
Your comment about “GIL-burdened scripting language interpreters … running fully synchronously instead of having an async event loop” seems confused. You’re conflating multiple categories of architecture: some spawn a process per core and use async IO, some spawn a thread per connection and use blocking IO. In the latter case they do use work-stealing because the OS will dynamically balance those threads across cores so that cores don’t sit idle. Some even spawn a process per connection, so they don’t share state but they use OS-level work stealing. Regardless, these are hardly relevant to a comparison of architectures that both outperform all of these.
So yes if you were to assume these things you might reach a certain conclusion, but they seem counterfactual to me so I don’t know why we’d engage in this thought exercise.
IIUC, they’re saying that even if the requests per connection varies greatly, the requests themselves require little in active CPU time and just orchestrate more IO (“fire off more queries”); In order words, they’re so small or dependent on asynchronously waiting it’s not worth introducing work-stealing to dynamically spread processing load.
Anecdotally, Ive found this to be true as well; Services which function more as high perf “routing” + simple business logic scale just as well, and sometimes better, by sticking to a single-thread + non-blocking IO. And this tends to be most backends which handle database and sub-service calls.
When heavy processing is needed they can offload that work to a thread pool before returning back to the IO thread. I believe the pool being mixed with the IO thread is unfortunate programming & perf burden that exists to avoid straying from opaque OS threads as a concurrency model.
That article was informative, thank you. I mainly wanted to contrast user-space vs kernel scheduling implementations for threads (virtual or otherwise, including stackless coroutines with explicit async/await points for linear-looking code), but conflated all of “async” and user-threading together which wasn’t very accurate. E.g. goroutines aren’t usually referred to as async, just a concurrency construct that largely present themselves as threads with transparent preemption where M:N threading is an implementation detail.
The most outspoken opponent of M:N user-space scheduling I recall was Bryan Cantrill, though perhaps that was more specifically about work stealing. I don’t recall the argument exactly, and don’t necessarily agree, mostly it left me thinking that perhaps my assumptions about the overhead of kernel threads were outdated. Nonetheless I’ve been happily using async Rust and haven’t been working in the kind of space where getting into the weeds of optimizing these things has mattered much.
I guess I’m honored to be the model’s most outspoken opponent? I reflected on this a while back[0] – and now 16 years further down life’s path (!), I stand by not just my thoughts on the M:N threading model, but also (obviously?) transactional memory, which is rightfully in history’s dustbin.
The most outspoken opponent of M:N user-space scheduling I recall was Bryan Cantrill
I wonder if that was about the 1990s Ssolaris threads, or something more recent.
That old Solaris M:N threading implementation was crippled because they didn’t provide enough new kernel facilities for the userland part to be able to work well, and the userland part didn’t do enough to compensate. So for instance, all filesystem ops were blocking and were not offloaded to worker threads.
The article buries the lede: they built a Rust async runtime that is basically a wrapper around Grand Central Dispatch on macOS! This is the operating system task scheduler, which means they basically get the dispatch behavior of a native Mac application. Very cool stuff!
Ah this is quite interesting actually. The major dichotomy I see in languages tends to be 1:1 threading (or just “threads”) and M:N threading (often called “async”; M userspace threads multiplexed over N kernel threads, like green threads, goroutines, etc.). I’ve also seen async refer to M:1 threading (coroutines/state machines/event loops on a single thread like Nginx or Node.js).
I hadn’t had much opportunity to empirically compare the different approaches myself, but the main argument I heard in favor of 1:1 threading is that an M:N model creates a “left hand doesn’t know what the right hand is doing” situation, frustrating the kernel’s ability to take advantage of its knowledge when scheduling, and that avoiding context switches or general threading overhead isn’t the bugbear it used to be anyway. So seeing what looks like an additional option here—using Rust’s async to schedule threads M:N, but with the kernel still steering the scheduling—sounds appealing.
This does remind me of Google’s switchto proposal alongside futexes to facilitate user-space threading as well.
I’m not particularly well informed or up-to-date on these things so happy to hear if I accidentally misrepresented something.
Asynchronicity is orthogonal to threading. For extra confusion, “async” these days tends to mean async/await (state machines built from “normal looking” code), as opposed to stackful coroutines.
Ehhh, AFAIK this isn’t any more kernel-based than any other common scheduler. libdispatch is just Apple’s userspace library that provides a scheduler and other concurrency stuff, and this article describes using it because it’s “platform native” and has better integration with Apple’s frameworks and the power management and whatnot.
This sounds roughly equivalent to running on glib’s event loop which would be “platform native” for a GNOME application.
Either way, it is indeed great to see the flexibility of Rust’s async/await in action!
One perhaps-not-so-well-named innovation arriving in the server space is a little “more kernel-based” in a way: the “thread-per-core” runtimes. Basically: lots of server applications are pretty much shared-nothing, so let’s just spin up completely independent event loops on
$num_cpusthreads, and (here’s the kernel part) letSO_REUSEPORTdistribute requests between them.(The linked runtimes also all switch to completion-based I/O—because
io_uringgo fast wooo—which results in ecosystem compatibility pain due to different read/write interfaces; but you can also replicate the architecture itself in a few lines with Tokio as the TechEmpower benchmark entries do)You’re right, but usually people contrast async/await with virtual threads, not stackful coroutines, which are different constructs. The grandparent is more confused though, in that there’s no connection between the interface exposed to users (virtual threads, stackful/stackless coroutines, actors, whatever) and the scheduling mechanism (1:1, N:1, M:N, etc).
But as I wrote in the article you linked: a lot of the times these cite benchmarks which do a well-balanced amount of work on each connection, so load balancing by distributing connections evenly with SO_REUSEPORT is an effective strategy. Unless your real workload also has that property, its questionable to cite these benchmarks in contrast to a more responsive (but also higher constant overhead) strategy like work-stealing.
That is to say, its not really about being shared-nothing (though thats a prerequisite for avoiding coordination between threads), its about whether merely balancing connections is an effective strategy for balancing work in your application.
Right, it’s definitely not a property every workload has, but I feel like it’s not uncommon at all. For web app backends especially it’s typical for all request handling to be essentially “fire off queries to DB/cache/third-party-service/etc, await, template/serialize, return” which results in little variety of work per request (and so—with some handwaving about clients doing roughly similar numbers of requests—per connection too). So to me it makes sense that that approach is getting some traction.
…oh, and of course the thing is that it’s kind of a tried-and-true architecture for web apps: all those GIL-burdened scripting language interpreters were often being scaled in an OS process per core way (or “just some larger number of worker processes than cores” when each process was running fully synchronously instead of having an async event loop). Those got away with merely balancing connections!
I don’t believe that it’s common for web clients perform roughly similar numbers of request per connection, and if the number of requests you perform per request is not the same that is also an imbalance of work even for stateless services.
Your comment about “GIL-burdened scripting language interpreters … running fully synchronously instead of having an async event loop” seems confused. You’re conflating multiple categories of architecture: some spawn a process per core and use async IO, some spawn a thread per connection and use blocking IO. In the latter case they do use work-stealing because the OS will dynamically balance those threads across cores so that cores don’t sit idle. Some even spawn a process per connection, so they don’t share state but they use OS-level work stealing. Regardless, these are hardly relevant to a comparison of architectures that both outperform all of these.
So yes if you were to assume these things you might reach a certain conclusion, but they seem counterfactual to me so I don’t know why we’d engage in this thought exercise.
IIUC, they’re saying that even if the requests per connection varies greatly, the requests themselves require little in active CPU time and just orchestrate more IO (“fire off more queries”); In order words, they’re so small or dependent on asynchronously waiting it’s not worth introducing work-stealing to dynamically spread processing load.
Anecdotally, Ive found this to be true as well; Services which function more as high perf “routing” + simple business logic scale just as well, and sometimes better, by sticking to a single-thread + non-blocking IO. And this tends to be most backends which handle database and sub-service calls.
When heavy processing is needed they can offload that work to a thread pool before returning back to the IO thread. I believe the pool being mixed with the IO thread is unfortunate programming & perf burden that exists to avoid straying from opaque OS threads as a concurrency model.
That article was informative, thank you. I mainly wanted to contrast user-space vs kernel scheduling implementations for threads (virtual or otherwise, including stackless coroutines with explicit async/await points for linear-looking code), but conflated all of “async” and user-threading together which wasn’t very accurate. E.g. goroutines aren’t usually referred to as async, just a concurrency construct that largely present themselves as threads with transparent preemption where M:N threading is an implementation detail.
The most outspoken opponent of M:N user-space scheduling I recall was Bryan Cantrill, though perhaps that was more specifically about work stealing. I don’t recall the argument exactly, and don’t necessarily agree, mostly it left me thinking that perhaps my assumptions about the overhead of kernel threads were outdated. Nonetheless I’ve been happily using async Rust and haven’t been working in the kind of space where getting into the weeds of optimizing these things has mattered much.
I guess I’m honored to be the model’s most outspoken opponent? I reflected on this a while back[0] – and now 16 years further down life’s path (!), I stand by not just my thoughts on the M:N threading model, but also (obviously?) transactional memory, which is rightfully in history’s dustbin.
[0] https://bcantrill.dtrace.org/2008/11/03/concurrencys-shysters/
I wonder if that was about the 1990s Ssolaris threads, or something more recent.
That old Solaris M:N threading implementation was crippled because they didn’t provide enough new kernel facilities for the userland part to be able to work well, and the userland part didn’t do enough to compensate. So for instance, all filesystem ops were blocking and were not offloaded to worker threads.
+1 for using the platform!
this also explains why I’m patiently awaiting the windows version… but I imagine it’ll be a similar approach, to use the system’s event scheduler!