1. 15
  1. 16

    The reason userspace threads are interesting, performance-wise, is not memory usage, but context switching. I can switch userspace threads in 10s of clock cycles; kernel threads have no chance of competing. I therefore find this benchmark fairly uninteresting.

    1. 3

      I would disagree here, sort-of. I am pretty sure that the main reason people reach out for green threads is that you can just spawn 10 million of those, while for threads, in practice, you need to do a bunch of sysadmin work to get even to 100k. Not sure if that’s “performance” or “robustness”, but to me it seems like the biggest reason. Why it is the case is still don’t know, besides the negative fact that it’s not memory.

      1. 2

        Is there really no way the kernel could provide a thread-style abstraction that was just as good, though?

        1. 6

          Depends on what exactly a ‘kernel’ is. It is possible you could come up with a novel hardware design which made context switches cheap (eg I think itanium could do it in 27clk or so?). And if you do memory protection in software rather than hardware, then context switches can be done entirely in software, and the performance impact will be similar to that of userspace threads.

          Operating under the same constraints as contemporary kernels, though, there is no way.

          1. 1

            I’m not sure why you are saying software outperforms hardware? Could you elaborate please.

            1. 6

              Modern operating systems implement a capability-based security policy enforced using:

              1. A set of trusted interfaces (‘kernel’, ‘syscalls’)

              2. Hardware memory mappings, to prevent privileged operations from being performed except through the trusted interfaces

              The former mechanism (and the policy it implements) are the interesting part of this picture; the latter mechanism is purely instrumental. Implementation of the latter mechanism involves creation of a unique hardware security context for every logical security domain (process). And switching between hardware security contexts is quite slow. This means that switching between logical security domains will also be slow (and in this case, the implementation of the trusted interface is its own security domain, so any time you have to go though it, you also have to switch hardware security contexts).

              The reason that it’s necessary to create hardware security contexts is that userspace programs are written in machine code, and the implementation strategy for that machine code is to feed it directly to the cpu. But machine code is able to forge pointers and has no native notion of capability safety. It does, however, have a very basic notion of memory safety: access to unmapped memory causes an exception to be thrown; this coarse ability can be used to implement the desired security policy.

              But now suppose our userspace programs are written in a language which is unable to forge pointers (where by ‘forge’ I mean ‘acquire, by other means than through the trusted interface or through existing unforged pointers’). Say, java. Then it would not be necessary to create distinct hardware security contexts for each logical security domain, since all pointer accesses occurring within a given security domain would, of necessity, be well-formed by construction wrt the security policy.

          2. 6

            There are two ways in which cooperative threading outperforms preemptive threading:

            The amount of state that needs saving is much smaller. A preemptive thread switch needs to save the full register contents. On a modern system with a load of vector registers, that’s KiBs of data. In contrast, a cooperative thread needs to save only callee-save registers. That’s typically a much smaller amount of state. The caller is responsible for saving everything else and, often, doesn’t need to because it’s not in use.

            Cooperative threading can yield directly to another thread, without invoking the scheduler (which, un turn, acquires locks and does a bunch of work that is considerably more expensive than a function call).

            The down side, of course, is that a cooperative-threading model is vulnerable to starvation if a thread doesn’t yield sufficiently often. A lot of modern systems aim towards a hybrid: cooperative scheduling for the common case, preemptive scheduling as a fallback after detecting starvation.

            1. 2

              Yeah, cooperative scheduling of concurrent threads is not robust. Even with safepoints, I would be somewhat wary. However, I think it’s important to distinguish userspace threading as an implementation strategy from coroutines as a programming model. See python’s generators as a prominent, representative example of the latter.

            2. 5

              Google did something sort of in this vein, using kernel threads but exposing an extra API so that the running thread could request a specific thread run next, both avoiding the work the kernel scheduler does and potentially giving userspace a chance to do smart things (e.g. ‘message’ a thread and immediately activate it). If you’re doing a pure context switch benchmark it can’t beat jmp of course, but they apparently found it useful.

              A talk is here and some code they published is here.

              1. 3

                Try looking at it this way:

                “The kernel” on Linux, Windows, (and so on) isn’t special[1]: It’s just another thread.

                Can we transition from thread A to kernel-thread K to thread B as fast as we can transition from thread A directly to thread B. The answer is no, because A+B < A+B+K

                [1]: It’s not special, but it is generic, so it has to handle AB as easily as BA and any other transition. Doing that is expensive too, so it’s sometimes useful to consider the fact you’re doing this transition AKA every time you do a system call. If you’re curious what kind of kernel changes are needed to avoid this kind of thing, just consider this very simple AKA transition and then go look at the iouring documentation.

                1. 4

                  That’s true, but misses the point (‘not even wrong’).

                  If I do two userspace thread switches, that’s still going to be way cheaper than a single kernel thread switch. (In your nomenclature, A+B+C < A+B+K. A+B+C+D+E+F+G+H < A+B+K, even.) The issue is that switching hardware security contexts is very expensive, and such context switches are not necessary when doing userspace threading.

                  1. 2

                    If I do two userspace thread switches, that’s still going to be way cheaper than a single kernel thread switch. (In your nomenclature, A+B+C < A+B+K. A+B+C+D+E+F+G+H < A+B+K, even.) The issue is that switching hardware security contexts is very expensive, and such context switches are not necessary when doing userspace threading.

                    While there is overhead here, I remember seeing some work from Google which measured the cost of entering the kernel, and finding it negligible compared to the cost of selecting the next thread to run (~100ns out of a 3 usec switch). Lots of that involves sending interrupts to threads running on other CPUs.

                    1. 1

                      But of course, if you have two kernel-threads, they can complete faster than one kernel-thread with two userspace tasks because we have lots of cores. So even though I agree with you that kernel threads can’t replace user threads, I can’t agree (if you are suggesting) that user threads can replace kernel threads always either; esp. when sum[max x+k]<sum raze x which it usually is in compute-heavy operations.

                      1. 1

                        I completely agree. I also, as mentioned else-thread, find hardware preemption interesting for robustness reasons, preferring it to safepoints (completely ignoring explicit yields).

                        (That said, I also find the hardware memory protection mechanisms associated with these ‘kernel threads’ overly heavy-handed; a sentiment which, given your experience with kos, I expect you might be inclined to agree with.)

                  2. 2

                    Co-operative context switching is easier if you take advantage of the system calling convention. On 32b Linux, all you need to save is four registers and a few instructions while on 64b Linux, it’s six registers and two instructions. This is way less state than pre-emptive context switching requires. It also avoids a trip through the kernel, saving even more time.

                  3. 1

                    yes but doesn’t it also fight with the scheduler which doesn’t know of the userspace scheduling that goes on with goroutines and such?

                    goroutines and other light-weight coroutines are great for implementing concurrency models but for performance I think it is better to use thread-per-core with very little shared data, exchanging messages when needed. As they say, code runs on the CPU and the kernel keeps interrupting it. So much, that the number of steps that one has to typically do for low latency systems is mind boggling [0]. Having another layer is only going to make it worse. no?

                    [0] https://rigtorp.se/low-latency-guide/

                    1. 3

                      fight with the scheduler

                      That’s more subtle a question than I can do justice to, but for user-mode scheduling, the pattern of system calls the kernel sees is mostly the same as you would for say a callback driven event loop runtime.

                      I think it is better to use thread-per-core with very little shared data

                      That can be good, too. That said, the choice of whether to use user-mode cooperative threads (eg: instead of callbacks / coroutines) can be orthogonal to the scheduling model. In theory you could have a thread per core, but guarantee that user-mode threads are (by default) scheduled on the local kernel-thread. (As opposed to multiplexing M user-mode threads on top of N kernel-threads).

                      But to be fair, I’ve not heard of anyone doing that.

                      Having another layer is only going to make it worse

                      Depends on the trade-offs you want to make. As someone else said, user-mode context switches can be very cheap, and maybe within the same ballpark as making an indirect function call.

                  4. 10

                    This seems like a…very poor comparison IMO:

                    • Rust is going to use very different amounts of memory than Go, the latter having a GC, so it’s hard to tell what memory usage is actually from the threading system and how much is from any extra overhead incurred in Go land. i.e. this could be more likely comparing two different language’s memory models than anything else. A more interesting comparison might be system threads vs Tokio tasks, exclusively in Rust land.
                    • “A thread is only 3 times as large as a goroutine.” That’s…quite an amount. Goroutines in Go are a very widely used construct, and you’re encouraged to spawn new ones as needed instead of, say, trying to shuffle tasks between threads. As a result, it’s not particularly unlikely you’d end up with a relatively large amount compared to the work you’re actually doing.
                    • This is, as the article implies, a very manufactured workload, and the performance characteristics will likely look different when the actual things being waited on are not just I/O.
                    • This update inside is pretty important:

                      As pointed out in comments, using solely RSS to compare memory usage of goroutines and threads is wrong. Thread bookkeeping is managed by the kernel, using kernel’s own data structures, so not all overhead is accounted for by RSS. In contrast, goroutines are managed by the userspace, and RSS does account for this. In particular, 10k threads with default stack sizes need about 40mb of page tables to map virtual memory.

                    1. 21

                      “A thread is only 3 times as large as a goroutine.” That’s…quite an amount.

                      Yes – But it’s a good reality check for people who program like it’s the 1990s, and worry about the overhead of 10 threads. I think most people’s guess at the cost of a thread is much higher than it actually is.

                    2. 7

                      FWIW, I still don’t know what’s the actual limiting factor for having millions of Linux threads. I am still pretty sure that’s not the memory, and I know that, in practice, you’ll run into various artificial limits which you can raise. But I still have no idea what’s the actual limiting factor there, which you can’t just configure away.

                      1. 3

                        A million threads on a 10-core machine gives each thread about 1µ of work every second, so at some point you’re simply spending more time switching tasks than you can spend doing any real work.

                        You simply cannot configure away the speed of light…

                        1. 8

                          In most of the designs you see with massive numbers of goroutines, those goroutines are not all running at the same time; many of them will be blocked on IO or otherwise waiting for something to happen; the point of making concurrency so lightweight is not CPU parallelism, it’s to allow concurrent design patterns that wouldn’t be practical using OS threads due to e.g. context switch performance (memory is secondary).

                          1. 1

                            I don’t disagree with any of that, I was only addressing the question of what the limiting factor is of having a million Linux threads (let alone having millions) and a small number of cores, and that limiting factor is in fact the speed of light.

                            1. 2

                              Note that your argument works for both goroutines and threads, it seems that it proves too much :-)

                        2. 1

                          I am pretty sure you can have a million threads on Linux. A decade ago I remember I managed to get a few hundred thousand. Of course this won’t be on your laptop with 4GB of RAM and you may need to set a pretty small stack size for each thread (which would make them not very useful as general-purpose threads).

                          1. 1

                            Hm, why would the stack size matter (assuming 64 bit address space and enough physical memory to hold kilobytes per thread of vm mapping itself)?

                            1. 1

                              Perhaps it no longer does. I distinctly remember I had to limit the stack size but I don’t remember why (and I am pretty sure I was running on 64-bit). It shouldn’t be difficult to recreate this test if you have access to beefy enough hardware.

                        3. 4

                          isn’t the point of Goroutines, i.e., user space threading supported thoroughly, that the user space avoids cache purging syscalls in thread scheduling?