1. 32
  1. 42

    This article is kinda the classic example of calling an API you disagree with technical debt.

    1. 8

      Your remark gave me new insight into why I hate the term ‘technical debt’. I imagine accountants have fairly objective criteria to determine what debt is. ‘Technical debt’ is usually created by someone vehemently claiming it is debt. Like ‘software engineering’, it’s another example of the software industry stealing terminology from other fields to make itself look better.

    2. 23

      Full disclaimer: I’ve never written code with epoll directly, only kqueue. Even then, most of my interactions are abstracted by libraries like libevent and the like.

      This article doesn’t make an actual case for claiming kqueue is a “mountain of technical debt.”

      This means that any time you want kqueue to do something new, you have to add a new type of event filter.

      As opposed to adding a new syscall in Linux? If so, I’d rather hear about why one approach is better/worse.

      The conclusion is terse and I’m not sure even correct:

      Hopefully, as you can see, epoll can automatically monitor any kind of kernel resource without having to be modified, due to its composable design, which makes it superior to kqueue from the perspective of having less technical debt.

      epoll magically monitors anything without modification? Maybe…but if you need to add new syscalls, isn’t that just meaning the work is being done elsewhere?

      It feels like there’s a good, valid comparison to be made between some of the capabilities of epoll (e.g. inotify) lacking in kqueue, but these aren’t really dug into.

      1. 15

        I’m with you. And having used both epoll kqueue, I have to say that this is the first time I’ve ever seen someone hold up epoll as the API without technical debt. This (admittedly slightly incendiary) article, for example, goes into why epoll had to add EPOLLONESHOT and EPOLLEXCLUSIVE flags, and how, even with those, it’s still a very hard API to use correctly. And even there, Linux finally adopting a more IOCP-like API in io_uring seems to validate that even Linux devs found some things lacking with epoll.

        I’m not saying kqueue doesn’t have flaws, but I don’t think this article makes its point well—and my own experience using the APIs in the real world goes rather the opposite direction.

        1. 6

          The main difference is that kqueue event filters are coupled to kqueue, while “magic” file descriptor types are just like any other kernel handle – can be used with select and poll too, can be passed around between processes, and so on.

          Of course event filters can sort of be converted into file descriptors by creating a kqueue per filter :) but event filters were not designed for fd passing so that doesn’t work, etc. (The reason we added native eventfds to FreeBSD is that the epoll-shim userspace emulation of it was not fd-passable, for example)

          1. 4

            The main difference is that kqueue event filters are coupled to kqueue, while “magic” file descriptor types are just like any other kernel handle – can be used with select and poll too, can be passed around between processes, and so on.

            Although worth noting that while the design allows this, this is a good way to shoot your foot clean off because epoll’s semantics subtly differ from what its API suggests – while it appears to deal in file descriptors, what epoll actually registers and operates on are the underlying kernel objects the file descriptors descript. The negative consequences of this mean that epoll is not really designed for fd passing to work well in most cases any more than kqueue is.

            It’s incredibly easy to pass around an fd to something that (unknown to you) dup’s it, close said fd when you want to unregister your subscription to its events, and then forever after receive events about the object because while you no longer have a handle pointing at it, the underlying kernel object’s refcount is still > 0.

            It’s a big enough issue that Illumos’ epoll compatibility call intentionally breaks compatibility in this respect because it’s such a foot-gun https://illumos.org/man/5/epoll

            While a best effort has been made to mimic the Linux semantics, there are some semantics that are too peculiar or ill-conceived to merit accommodation. In particular, the Linux epoll facility will – by design – continue to generate events for closed file descriptors where/when the underlying file description remains open. For example, if one were to fork(2) and subsequently close an actively epoll’d file descriptor in the parent, any events generated in the child on the implicitly duplicated file descriptor will continue to be delivered to the parent – despite the fact that the parent itself no longer has any notion of the file description! This epoll facility refuses to honor these semantics; closing the EPOLL_CTL_ADD’d file descriptor will always result in no further events being generated for that event description.

            1. 2

              the Linux epoll facility will – by design – continue to generate events for closed file descriptors where/when the underlying file description remains open

              Yeah, that sounds bad.

              epoll is not really designed for fd passing to work well in most cases any more than kqueue is

              I’m not saying epoll is. The various something-fd’s are, no matter what you poll them with.

              1. 1

                close said fd when you want to unregister your subscription to its events

                That’s not correct usage of epoll, though. To unregister your subscription to some events, you need to call epoll_ctl(EPOLL_CTL_DEL) before closing the fd. If you do that, everything is fine. Trying to skip doing that and instead just close the file descriptor directly has always been wrong.

                1. 5

                  That’s true but kind of tangential to the point, which is that epoll only gives the appearance of being an API designed around file descriptors. You have to use EPOLL_CTL_DEL precisely because it actually operates on file descriptions, and the subtleness of that distinction and the misleading nature of the API “being designed around descriptors” is why people don’t remember or necessarily understand that they have to use EPOLL_CTL_DEL – unless and until you dup the descriptor, everything appears to work just fine with just closing it.

                  It’s always been wrong to shoot your own feet with a footgun too, but the design’s still the fundamental problem.

                  1. 3

                    No, the “footgun” of epoll here is no different from other footguns related to duplicating file descriptors, which for better or worse is very powerful but also requires a certain level of careful programming.

                    For example, a novice might assume that they can pass a file descriptor pointing to some file to a child process, which will duplicate the file descriptor, and then in the parent read and write that file independently with their original file descriptor. Of course, this will fail badly, since both the parent and child will update the file pointer stored in the open file description independently - even though the child and parent are operating on different file descriptors.

                    Or a novice might duplicate the write end of a pipe without realizing that that’s a problem, close the original write end, and then wait forever on the read end of the pipe for a HUP that will never come.

                    Powerful features, in traditional C-based Unix-like systems without static analysis and type systems, unfortunately require careful programming. epoll is a powerful feature, and the fact that it (like the rest of Unix/Linux) operates on open file descriptions means that it requires careful programming. The alternative implemented in Illumos is strictly less powerful (and also slower) which means it requires less careful programming. That, alone, doesn’t mean it’s better or worse.

            2. 4

              The biggest practical issue is that kqueue requires a file descriptor per file if you’re watching files, which can be problematic for some use cases. This, however, seems like a fixable problem to me and hardly a “mountain of technical debt” (not sure why this hasn’t been done yet actually, so maybe it’s hard and there is a bit of technical debt here).

              1. 1

                kqueue requires a file descriptor per directory if you’re watching files

                Hm. I thought inotify did too.

                1. 2

                  kqueue is a fd for every file, not directory; I remembered that wrong 😅 You can run in to problems with this using Dropbox, syncthing, and such.

                  1. 1

                    My vague very-likely-incorrect memory was that with both of them you need to watch the directory above the file you care about. Otherwise I thought you didn’t necessarily see events when some other process rename()s a new file over the one you’re watching, because you were watching the inode of the replaced file?

                    But there is a “deleted” event you can watch for with inotify so perhaps that covers it / tells you when the inode’s refcount changes.

                    1. 8

                      inotify is quite a problematic API. It is path based, but a *NIX filesystem is not a tree, it’s a DAG. If you have a file that has two hard links and you use inotify to watch one, then you won’t see modifications through the other. Using kqueue, you’ll see modifications to the file but at the cost of needing one file descriptor (which comes with a small chunk of kernel memory) for every single file in a tree. I think that macOS has a stronger reverse mapping in the VFS layer, which helps their equivalent API.

                      I think you have a choice in designing such an API between accuracy and overhead. XNU’s FSEvents aims to be efficient and give false positives. Kqueue is accurate but (very) high overhead. inotify is efficient but gives false negatives. As a userpace developer, FSEvents is probably closer to what I want: the overhead for scanning a file and determining it hasn’t changed is lower than the overhead of missing an update and needing to rescan everything. For watching config file changes, kqueue is fine because although it’s a high overhead per file, the number if files is small.

              2. 2

                As opposed to adding a new syscall in Linux? If so, I’d rather hear about why one approach is better/worse.

                Adding a new fd type can be self-contained in theory. It’s probably easier to add a new fd type in a kernel module than to add a new kqueue filter type (though it’s worth noting that Linux makes this hard by not exposing a load of the functions for fd-table manipulation to modules). The real problem with both is that they need to fit into a quite constraining existing shape. File descriptors require you to read or write them, when you may want an interface that isn’t stream-oriented. kqueue lets you define a richer vocabulary of verbs, but constrain your data to the shape of struct kevent, so if you want to communicate more data than a handful of words, it’s problematic.

                You can see these limitations in the integration of signal handling with kqueue and epoll. With kqueue, you can register for signals and you get a count of the number of times the signal has been raised. That’s it, because that’s all that fits into the kevent structure. It’s a nice API if you want a coalescing mechanism for signals, but it’s very limited. I believe it was originally added for AIO callbacks (I might be completely wrong, but I think EVFILT_AIO uses the same mechanism and EVFILT_SIGNAL is just there because most of the plumbing had been done for EVFILT_AIO already, so it was effectively free). In contrast, epoll requires you to use signalfd. This means that you get an epoll notification that there is something ready to read and then you must do multiple read system calls to get the signal info. You get more information from the signalfd approach, but you also need more system calls and you are using a stream-oriented to get fixed-sized messages from the kernel.

                It’s interesting to consider as a thought experiment what would happen if you tried to add futex / _umtx_op support to both. On Windows, you can use WaitForMultipleObjects to wait for n mutexes and other I/O interfaces, but *NIX systems don’t tend to have that ability. Linux had futexfd in the 2.6 series and removed it because it was inherently racy. Just adding the file descriptor to epoll does not have enough information to register for the event and specify the futex operation and existing value, but if you separate the wait into an operation on the file descriptor and the notification mechanism then it’s hard to get the required atomicity.

                In contrast, I think it would be very easy to fit this kind of notification into the kqueue interface. The kevent structure has enough space to specify the address, the value, and the operation. When you do the kevent system call, the kernel has everything that it needs to do the equivalent of a _umtx_op system call to atomically check the value and register for the event. There’s no reason that you couldn’t have an EVFILT_UMTX that would let you wait for multiple userspace mutexes / semaphores / whatever. I think you’d probably need to require that it used EV_ONESHOT, but that’s a fairly minor restriction.

                The root cause of this difference is that not every event source has a persistent kernel object associated with it. In the cast of a futex, the lock object only exists (from the kernel’s perspective) as long as one or more threads is waiting on it. This is intentional: userspace can have as many mutexes as it has memory for and it only needs the kernel to pay attention to them when they’re blocking, so the amount of kernel state is significantly lower than for pure-kernel mutexes.

              3. 20

                I thought it was interesting that the author took FreeBSD adding eventfd support as a sign that they might be moving away from their current architecture. I don’t work on or with FreeBSD, but I can tell you when we add decidedly Linux-ish APIs like that to illumos, it’s almost invariably to cut down on the effort required to port some Linux-centric body of software and not because the API is, itself, some absolute good. We also have an implementation of epoll, for instance, even though we would recommend people use event ports (a bit like kqueue) wherever they are able.

                1. 6

                  Here is the revision where it was added and it seems like it is the reason is as you say to make porting easier. https://reviews.freebsd.org/D26668

                2. 2

                  I’m an active developer on libkqueue. libkqueue is a userland translation library for Linux, Windows and Solaris.

                  The author really misrepresents some of the capabilities of Linux. I’m going to use the proc filter here, as this is something i’ve been actively working on fixing in libkqueue for the past few weeks.

                  kqueue’s proc filter can monitor any process (that the calling process has visibility of) for exit, fork, signal reception, execve, and reaping. pidfd can only be used to monitor direct children of the current process, and only for exit. netlink is the only equivalent functionality to proc on Linux and it is by no way as usable. You end drinking from the systemwide event hose (unless you feel like hand crafting some BFP rules), and that’s only after you jump through the hoops of getting the right CAP permissions.

                  At the time of writing pidfd has only been in the kernel for about 2 years (first released in 5.3). You can pick it up just fine on ubuntu-20.04, but it’s absent on RHEL8. Before pidfd, if you had a program that wanted to be asynchronously notified of a process exiting, and you didn’t want to deal with netlink (or didn’t have the permissions to use it), you needed a horror like this. This code spawns a “waiter thread”, who’s only purpose is to wait for SIGCHLD, then go scan all the PIDs other threads in the process want to be notified about, and notify them (via eventfd), that the child process has exited. As you can imagine, this does not scale well.

                  1. 2

                    UNIX uses the term file descriptor a lot, even when referring to things which are clearly not files, like network sockets

                    Nit: unix uses the term ‘file’ a lot, including to refer to things that are not persistent, hierarchically-accessed records, like network sockets.

                    1. 1

                      From my understanding, a file in Unix’s sense is essentially a stream of bytes.

                      1. 1

                        The term is not entirely well-defined (which is part of the problem). I think that, at the very least, a file encompasses a stream of bytes, but there are files which are not simply streams of bytes. Anything that can be ioctled, for instance.

                    2. 2

                      This was interesting and helpful for understanding for me, but I did not see anything about technical debt. Is there going to be another article in a series or something?

                      1. 2

                        The fact that there are multiple APIs for concurrent I/O is the technical debt.

                        1. 1

                          TFA seems to suggest that kevent/kqueue has more technical debt than epoll, so I don’t think that is the argument being made.

                          1. 5

                            The article author is biased. It’s easy enough to take a step back from their particular perspective and note that all of these multiple APIs are bad, and further that the forced choice of API is onerous for the programmer.

                            1. 1

                              What does a good API for concurrent IO look like, in your opinion?

                              1. 2

                                A few years ago, I would have said that a good concurrent I/O API should have two methods:

                                • Enqueue an action to occur in a fresh isolated concurrency domain, and return a handle for completing or cancelling the action
                                • Given some number of seconds, create a fresh handle which will complete after that time has elapsed

                                An example of this API in the real world is in Twisted Python, where the IReactorTime interface contains equivalent methods. This is the main low-level interface which is used to build concurrent applications with Twisted.

                                In the Spectre era, though, we need to privilege that second method; it shouldn’t be primitive. So, the core of the API is a single method which postpones actions until after the current computation has finished, optionally scheduling mutually-isolated actions to run simultaneously.

                                Now, let’s look at some technical debt. Twisted Python’s I/O core contains thirteen different implementations of an API which includes special methods for signals, sockets, timers, and file descriptors, since those are different on many different platforms. Integration with GUI libraries is ad-hoc and must be reinvented for each library. Windows support requires platform checks throughout the code; the select() wrapper used to be its two modules (Windows and non-Windows) and I don’t know if it’s more or less readable when currently sewn together into a single module.

                                Given this history, it should now be understandable why the original author praises Linux for a unified file-descriptor approach. It’s less code for high-level networking libraries to maintain!

                                1. 1

                                  Thanks! Why do you think Spectre means we need to privilege that second method?

                                  1. 1

                                    Spectre is all about timer abuse. If we can reduce the ability of code to measure its own time taken to execute, then we can reduce the chances of introducing Spectre bugs.