GHC’s runtime is based around an M:N threading model which is designed to map a large number (M) of lightweight Haskell threads onto a small number (N) of heavyweight OS threads. […] To cut to the chase, we ended up increasing N to be the same as M (or close to it), and this bought us an extra 10-20% throughput per machine.
Ah, yes. As a Go runtime developer, I really wish Go moved to an 1:1 threading model.
Modern kernels deal with large numbers of threads just fine. People misjudge how cheap kernel threads really are, because there was a time where they were expensive. That time was long ago and that time has passed. Today, threads are very light weight (on operating systems that matter). For example, the Solaris kernel schedules interrupts as threads, they are really that light weight.
There are good reasons POSIX threading is implemented in the kernel on every relevant operating system (finally we have it on OpenBSD too!), even if they all started as N:M or 1:M implementations.
Modern kernels deal with large numbers of threads just fine.
Do you have any performance numbers to back this up?
The last time I saw this claim made someone did some tests on HN and M:N model, for number of threads, blew away what the POSIX threads on Linux could do. That was also without counting any cost in context switches. In the benchmarks I’ve seen in order to get large number of threads one also needs to fiddle with stack sizes and make sure their code doesn’t blow the stack at all compared to something like Go or Erlang which will dynamically resize those things.
In could be possible that particular usecases suit 1:1 well enough but in terms of general comparison, everything I’ve seen shows M:N beating 1:1 for number of threads that can be executed.
I don’t have any performance numbers but if you want some anecdotes that agree with you:
Intel TBB uses N:M (and Intel are obviously also the ones building the CPUs), and every game released today has some kind of N:M job system. e.g. Destiny, Handmade Hero, Naughty Dog
I don’t have hard numbers on hand, but when “large numbers” exceeds the number of cores, performance declines seem to be small. (My understanding is that this wasn’t always true and it was considered bad practice to have more active threads than cores.) As in, you don’t get killed running 48 threads on a 4-core box and may not need to use thread pools.
When people treat OS threads as if they were free and spawn hundreds of thousands of them, the story is different. Sometimes that can happen intentionally when higher-level abstractions (e.g. actors) are used beyond what they were intended for.
That matches my non-scientific experience. My general rule of thumb is OS threads for throughput, userland threading (or event loop) for latency. I spend most of my time being I/O bound so I use event loops or userland threading. I haven’t seen any indication that this is the wrong choice or that OS threads are competitive in the I/O bounded world as the base unit of concurrency.
This fits my experience with GHC’s RTS as well.