1. 19
    1. 3

      Don’t use select() anymore in 2021. Use poll(), epoll, iouring, …, but for heaven’s sake don’t use select().

      select() is significantly faster than poll() and if you need to accept() new connections, it’s faster than epoll() as well.

      For many applications you are almost certainly much better off using multiple processes and SO_REUSEPORT.

      iouring is very exciting on the other hand, and I welcome the 1990’s Windows NT IO model finally coming to Linux, I just wish the interface was nicer.

      1. 4

        select() is significantly faster than poll() and if you need to accept() new connections, it’s faster than epoll() as well.

        I’ve not heard that before, do you know why? I learned about these at about the time kqueue was introduced in FreeBSD, so I never used select or poll in anger, but even in my undergrad UNIX course I was told ‘don’t use select’. Both were in POSIX2001 and so poll was the lowest-common-denominator for all of the UNIX systems I’ve used and the thing I’ve used for fall-back code when kqueue wasn’t available.

        1. 3

          Would it be a good guess that you took your undergrad UNIX course in the… very late 90s, or more likely very early 00s :-D?

          I suspect this is a curious case of system culture at work here. I was also told “don’t use select” but I didn’t learn network programming on Linux. On the other hand, I’ve seen at least a generation of fresh grads who only knew about select and poll because Linux’ epoll has a pretty troubled history that lots of universities with Linux-only labs didn’t want to expose undergrads to. At best, they knew there’s also epoll, which is faster, but has some problems, and they’d also maybe heard about kqueue and I/O completion ports, but they’d never used either.

          There’s a whole generation of programmers that grew up “knowing” select is the only reliable choice. I don’t want to dispute the performance claims in this thread (I don’t think I’ve ever written software that had to handle more than 100 connections, let alone 10,000 :-) so I don’t know enough about the performance implications of either) – what I can say is that, for every person who’s heard “don’t use select” in their undergrad course, there’s at least one person who’s heard “don’t use kqueue/epoll/whatever unless you know what you’re doing” in their undergrad course.

        2. 2

          I’ve not heard that before, do you know why?

          I presume you’re asking why is it faster? I know some of the reasons, but maybe not all of the reasons:

          1. You have a savings on syscall counts.

          2. Scanning a bit-array is faster than chasing a linked-list. A lot faster.

          Benchmarking this stuff is pretty tricky.

          But maybe you meant something else?

          1. 1

            I presume you’re asking why is it faster? I know some of the reasons, but maybe not all of the reasons:

            Yes, why select is faster.

            You have a savings on syscall counts.

            Not for select vs poll, these have the same number of calls. For select vs kqueue, you have fewer syscalls for kqueue but the select call has to recreate a load of state (which involves acquiring multiple kernel locks) on every call, whereas this state is persistent in the kernel with kqueue.

            Scanning a bit-array is faster than chasing a linked-list. A lot faster.

            None of these mechanisms involve a linked list. Select uses a bitmap, poll uses an array, kqueue involves an array for registering and then it’s up to the kernel to pick the optimal data structure for maintaining the state. With both poll and select, you need to look up the files in the file descriptor table (which involves lock acquisitions or RCU things, depending on the kernel), then lock the file structure, then query it. With kqueue, the pending events are registered in the kqueue object in the kernel when they appear and the kernel doesn’t need to acquire any locks to check for events on objects that don’t have pending events.

            Even between select and poll, it’s not clear that walking the data structure passed in from userspace is faster, unless the occupancy for select is high. Select will hit your branch predictor pretty high if you’re scanning each bit and branching on it, so you’re likely to see some mispredictions, whereas the bottleneck from parsing the poll structure is more likely to be from extra cache misses.

            1. 2

              For select vs kqueue, you have fewer syscalls for kqueue

              Incorrect. After accept() you need to add it to the kqueue with EV_SET and kevent call to get notified. If you have a lot of new connections then this approaches double the number of syscals, but if you don’t, you’re also likely to have few fds.

              None of these mechanisms involve a linked list.

              Incorrect. Look at what the kernel does around poll: https://elixir.bootlin.com/linux/v5.7/source/fs/select.c#L138

      2. 4

        This claim is missing indicators of scale.

        For a very few FDs, I can well believe that select() is faster than the others. But once you start dealing with hundreds or thousands of FDs, the repeated copying across the user/kernel boundary of the complete list of FDs on every select() call drowns out everything else and your performance suffers.

        So there is a fixed overhead to using an epoll()/kqueue() setup for the extra system calls for state management, but after that they scale significantly better.

        1. 2

          This claim is missing indicators of scale.

          At what “scale” do you think what I said doesn’t hold true? Do you have benchmarks explaining exactly what you mean?

          For a very few FDs, I can well believe that select() is faster than the others. But once you start dealing with hundreds or thousands of FDs, the repeated copying across the user/kernel boundary of the complete list of FDs on every select() call drowns out everything else and your performance suffers

          select() doesn’t copy “the complete list of FDs”: 1024 file descriptors is 128 bytes because each bit is given a position in the fd_set. poll() on the other hand, chases 1024 pointers through a linked list.

          So there is a fixed overhead to using an epoll()/kqueue() setup for the extra system calls for state management,

          If you are dealing with a large number of accept() calls, the “fixed” overhead isn’t fixed at all.

          but after that they scale significantly better.

          There’s that word again. What do you mean by “scale”?

          The only benchmark I’m aware of that has epoll() beating select() is when you have mostly-idle connections, and it’s from 2004, and I’ve never been able to reproduce it:

          https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.538.1749

          1. 4

            What linked list? poll(2) takes a flat array of struct pollfds. struct pollfd doesn’t have any pointers in it. There isn’t a linked list here on the userland side of this interface. (I have no idea about the kernel.)

            It’s a little bit of a shame that it isn’t SoA layout so walking the array looking at revents fields isn’t as dense as a bitset. Arguably a waste of cache line fill rate. But it’s not a “throughput dropped to 1/memory latency” hell like a linked list puts you into.

            I suspect the limitation of fds having to be under 1024 (FD_SETSIZE) is a large reason why people who write or teach networking tutorials say you might not want to use it. That lesson then gets half remembered for years.

            An example I’ve heard of that causing crashes in production: https://rachelbythebay.com/w/2011/06/02/fdsetsize/ Process has about 1020 fds open for different files that it wants to access. It accept()s a connection, which gets fd 1025. It then tries to select(), which goes poorly. (And it ideally should have segfaulted when it tried to FD_SET() just before the select() call.)

            If it had been written with poll(2) instead, then…

            struct pollfd pfd;
            pfd.fd=1025;
            pfd.events=POLLIN | POLLOUT;
            int result=poll(&pfd, 1, -1);
            

            … would probably have been fine.

            Aside, I’ve used kevent/kqueue once or twice and as far as I remember it’s pleasant enough. Kind of a shame Linux didn’t copy it, then we could have the same interface on every unix-like.

            1. 3
              1. 1

                I’m surprised that cargo has that many fds open at once. Thank you for the interesting tidbit

            2. 1

              What linked list?

              The one the kernel uses to implement poll() around https://elixir.bootlin.com/linux/v5.7/source/fs/select.c#L138

              If it had been written with poll(2) instead, then…

              Another option would simply be to allocate the fd_set* on the heap according to the number of file descriptors used instead of stack-allocating it.

              This is sadly a very common mistake.

              1. 1

                Ew. But eh it’s probably fine, the first sixteen or so entries are allocated in a single inline chunk. This interface isn’t great for large numbers of mostly idle sockets anyway.

                1. 1

                  No it is not, and to the best of my knowledge, if you have a large number of idle sockets, epoll/kqueue will beat select().

              2. 1

                Another option would simply be to allocate the fd_set* on the heap according to the number of file descriptors used instead of stack-allocating it.

                That’s not legal per the documentation. e.g. FD_ZERO doesn’t take a size parameter, just a pointer and nobody’s actually supplying you with a guarantee that the number of bytes you need to allocate is (nfds + CHAR_BIT - 1) / CHAR_BIT.

                1. 1

                  That’s not legal per the documentation

                  It’s perfectly legal: Ask any lawyer if you’ll get arrested for doing it, and they’ll wonder why you’re even bothering to ask. :)

                  POSIX allows an implementation to define an upper limit, advertised via the constant FD_SETSIZE, on the range of file descriptors that can be specified in a file descriptor set, but it does not require an implementation to define an upper limit, and in fact, no unixish system you’re likely to run into does. On BSDs (including OSX), you can even #define FD_SETSIZE before including <sys/select.h> to get fd_set_t objects the desired size.

                  I wouldn’t worry too much about hypothetically POSIX-confirming systems like Windows NT 4 because POSIX conformance doesn’t mean it will work, or it will work fast, and especially for stuff like this: You’re not even guaranteed to be able to get 1024 file descriptors in POSIX.

                  FD_ZERO doesn’t take a size parameter,

                  Correct. You should use memset instead of FD_ZERO if you are doing this.

                  1. 1

                    You still don’t actually have a macro or function or anything that is documented to work for getting you the right size for the fd_set allocations. This seems like a lot of unportablility to put up with for something that isn’t even going to be particularly fast.

                    1. 1

                      This seems like a lot of unportablility to put up with for something that isn’t even going to be particularly fast.

                      Yes. Un-portable to systems that don’t exist, and “particularly fast” is always relative. Benchmark don’t speculate.

                      For my use case it adds about 10% on qps, which is basically a 10% cost savings.

                      You don’t have to like it, and heck, I don’t like it either, but some people have work to do, and this shit matters.

          2. 3

            I don’t have current numbers, only the classic C10K problem paper and any updates linked from there. (Sorry, I don’t mean to be dismissive, I’m just very busy today and dashed off a quick reply earlier without spending time on it.)

            The poll(2) system call doesn’t work well and it’s been decades since I last saw anyone recommend it (1990s). The epoll(2) and kqueue(2) calls, on the other hand, don’t have those problems. But managing all this is why we have libevent/libev2/whatever-new-hotness.

            I confess, I haven’t re-tested for myself in a very long time. I recall seeing select() scale poorly to even a couple of thousand connections and kqueue() fixing it.

            I think I was wrong to ascribe the cost of the complete list to the actual copying across the userland/kernel boundary. Now that I pause to try harder to remember, it’s the cost of the setting up of the kernel’s data structures that was dominant. The earlier design approaches tried to avoid keeping state inside the kernel across multiple system calls for select, feeling that state “on the other side” was wrong. Later work admitted that no, it’s better to keep state and reference a handle which you can use in add/subtract operations. Which feels much like some of the changes in NFS: people try hard to avoid remote-side state before discovering that the lack of state is causing more problems than it solves.

    2. 3

      Until reading this post, my shell startup raised the soft limit to the hard-limit for descriptors, and a few other things. This is a well-written well-reasoned post which led me to change my configs.

    3. 1

      It took me a while to realize: this is a per-process issue. If your process doesn’t use a lot of FDs, then feel free to use select() if you like; there’s no way that external conditions can ramp up the FD numbers.