The Windows approach to this problem deserves more description: instead of waiting for an fd to be IO-ready then operating on it, the Windows approach is to queue up an IO operation and get a notification when it succeeds. This notably doesn’t suffer from the spurious wakeup problem of select, and it also solves the problem of having to break up writes since you don’t have to worry about the case where your write blocks halfway through - you’ll just get a notification when the whole buffer has been written.
Posix specifies AIO for much the same purpose, but Linux AIO is implemented in user-space with threads and locking, which sort of defeats the purpose. There’s work on this front but each filesystem has to implement support for it, so it’ll be awhile if ever before it’s generally usable. Also this doesn’t cover the case of asynchronous accept().
Forcing read IO to queue up turns out to have its own interesting drawbacks, as I discovered through reading Goroutines, Nonblocking I/O, And Memory Usage. The short version is that it forces you to commit memory buffers up front for the read, instead of obtaining them only when you know you’re about to read some data. If you have a lot of generally inactive connections, this can add up.
Good point. A fix for this would be to add an aio_read_available() call which would notify (once!) when a file is read-ready - it would thus behave the same as the EPOLLEXCLUSIVE flag, except using the same queue as the rest of AIO. AIO would thus offer a performance trade-off: if you want to keep memory usage low, you can allocate buffers on-demand but suffer extra round-trips to userspace, or you can allocate the buffers up-front and a large read can stay kernel-side. Right now you can do roughly the former with select(), but the latter requires using blocking read/write with userspace threads, which is way more expensive than these operations should be.
The glibc implementation of POSIX AIO is thread-based, yes, but Linux also offers true kernel level AIO with the io_submit family of system calls. (Though libaio, through which one typically uses them, is a bit under-documented and kind of a pain to use.)
That’s what I meant by “work on this front.” As I understand it the issue is that filesystem drivers are actually written in a blocking style kernel-side, so to support AIO each individual filesystem needs to add a new nonblocking call. Ext and xfs have implemented this, most others haven’t. The libaio calls also silently degrade to synchronous IO on sockets, which is where select() is relevant.
My understanding is that making a filesystem have truly asynchronous read IO is quite hard and will wind up with a lot of code restructuring. The big problem is all of the internal IO the filesystem may have to do to retrieve metadata (such as indirect block pointers) now has to be changed from inline synchronous IO (where the code does ‘read this block; wait for block to be ready; repeat’) into a whole stack of asynchronous callbacks or an explicit state machine for the various steps of the overall IO.
(Write IO is often easier. Data writes themselves normally get delayed anyway, and deferring block allocation until the time of the actual disk writes has benefits well beyond just enabling asynchronous IO. So all you may need to do at write() time is make sure there’s enough free space left and reserve some of it.)
This was a great article and finally explains to me why select is “slow”. Does anyone know a similarly good article on kqueue?