The paper looks great. It’s worth noting that it isn’t a particularly new idea. QNX, for example, has had an asynchronous system call model for a very long time (decades) and 20 years ago they were making the same claims: you can do lightweight asynchronous message passing of a set of system call request/response pairs between a running userspace process and a running kernel thread on different cores. They showed a similar speedup versus *NIX systems.
Hypervisors learned from this and try very hard to keep hypercalls for control-plane operations or things that absolutely have to by synchronous. Xen uses a lockless ring buffer for all device I/O, where the unprivileged guest puts requests into the ring, the privileged dom0 (or a separate driver domain) pulls them out and pushes the response back after doing I/O. You only ever do a hypercall if the ring transitions in or out of the full or empty states (if it becomes full, the producer sleeps, if it becomes empty, the consumer sleeps, if it becomes non-full the consumer notifies the producer to wake, if it becomes non-empty the producer notifies the consumer to wake). These are very simply hypercalls (just move a VCPU in or out of a scheduler queue) and so don’t disrupt execution much.
I was confused by the fact that this article references relatively recent things like io_uring and didn’t notice that the original paper is from 2010. The ideas were a lot more new back then.
I’ve been wondering why we don’t do this for about 20 years now, I just assumed it was tried and firmly rejected and I was an idiot :)
The main reason is a combination of latency and sequential programming models.
A system call is lower latency than anything that involves the scheduler, so having a kernel thread pick up a queue of system calls and run them asynchronously with respect to the userspace thread may give better throughput, but will also add latency. You can improve the throughput of the synchronous model by adding more threads that do synchronous calls, you can’t easily improve the tail latency of the asynchronous model.
The UNIX ecosystem is dominated by C and languages with a C-like abstract machine. It is very difficult to make efficient use of an asynchronous model in C because the entire language is structured around synchronous call-return. You end up in a callback mess or you end up with a lot of code explicitly waiting for an asynchronous operation to complete. If you start with a language that exposes asynchronous abstractions, this is a lot easier. QNX had an asynchronous model from the start and it was fantastic for code written for it and pretty bad for running code written for *NIX.
That said, Linux now has io_uring, which has a lot of these advantages. It’s worth noting though that one of the big wins from io_uring has nothing to do with the asynchronous nature. POSIX requires that any newly allocated file descriptor must use the lowest unallocated file descriptor number. In a multithreaded program, this imposes a significant synchronisation burden on the kernel. With io_uring, there’s a per-ring space for file descriptors and so these can the entirely local operations.