1. 17

  2. 4

    I’m not much experienceed in case of i/o multiplexing, but lately I do something around the area. For my uses simple poll is enough, but it get’s ugly, because file i/o has it’s own special blocking behaviour (or everything is special, but not file i/o).

    I was thinking about how I would like the API to look like. I would like to have something like this:

    fd = pollpipe(timeout /* ? */)
    struct pollfd fds[] = { /* all fds as in normal poll */ }
    write(fd, &fds, sizeof fds) /* kernel should change the monitored events for supplied fd */
    /* so 0 in events would remove fd from monitoring */
    while(1) {
            struct pollfd buf[1024] /* or whatever is needed for read to be atomic */
            nbuf = read(fd, buf, sizeof buf) /* it would give only what occured */
            for(i = 0; i < nbuf/sizeof(buf[0]); i++) {
                    buf[i] /* do whatever */

    I think it would be quite a simple and clean API. Then on Linux you could monitor timerfd, signalfd and whatever you like. Also it would have to treat file fds not as special snowflakes.

    The question is: would that make sense? Could problems described in this article occur?

    1. 5

      You just redeveloped “/dev/poll” from Solaris from which Linux’s epoll() API was derived.

      The problem is when you have more than one thread waiting on the same file descriptors. All the scenarios in the article involve more than one thread.

      1. 2

        Thanks for the answer! Now I know that I would like Solaris as I think that the /dev/poll would serve many cases (like mine). Too bad, it is an improvement on poll which at least fits between basic Unix APIs. I would welcome it’s addition.

        But if it would have pipe semantics wouldn’t it be different? As I understand it with pipes only one reader will take what’s in the pipe. Is that not possible with device file? If every thread has its own poll fd, kernel could still know to push this fd only through single connection Am I wrong?

        With thundering herd problem if it would be special file descriptor kernel could know that it should wake up only one thread.

    2. 1

      Discussion on the previous article in this series here for background.

      1. 1

        I know this was posted some time back, but I want to point out that the blog post has some inaccuracies / misunderstandings. For example:

        Without EPOLLEXCLUSIVE, similar behavior it can be emulated with edge-triggered and EPOLLONESHOT, at a cost of one extra epoll_ctl() syscall after each event

        This is not correct. EPOLLEXCLUSIVE causes an event to be delivered to a single epoll instance of multiple instances to which the fd is registered, whereas EPOLLONESHOT is effective within only the individual epoll instance for which it is specified. So, if you actually have multiple epoll instances with a listening socket registered in each one, EPOLLONESHOT cannot prevent the “thundering herd” problem.

        What I think is missing from the discussion is the nature of two flags and their respective use cases. One is for separate epoll instances, which is more applicable to when you are using forking for parallel excution, and one is for a single epoll instance when you have multiple threads polling events from the same epoll instance. Once you understand this, the interface isn’t actually all that broken, even if it’s not entirely problem-free. For a single-process multi-thread program (with a single epoll instance), you just use EPOLLONESHOT, and you can even overcome the starvation problem just by re-enabling events after successfully accepting a few connections (i.e. if it appears that there is high load, allow other threads to accept connections). For a multi-process program with one epoll instance per process, use EPOLLEXCLUSIVE for listening sockets that are shared between processes.