1. 26

The received wisdom suggests that Unix’s unusual combination of fork() and exec() for process creation was an inspired design. In this paper, we argue that fork was a clever hack for machines and programs of the 1970s that has long outlived its usefulness and is now a liability. We catalog the ways in which fork is a terrible abstraction for the modern programmer to use, describe how it compromises OS implementations, and propose alternatives.
As the designers and implementers of operating systems, we should acknowledge that fork’s continued existence as a first-class OS primitive holds back systems research, and deprecate it. As educators, we should teach fork as a historical artifact, and not the first process creation mechanism students encounter.

    1. 12

      Not a fault of fork per se, but I am still amused by this API design bug:

      https://rachelbythebay.com/w/2014/08/19/fork/

      Guess what happens when you don’t test for failure? Yep, that’s right, you probably treat “-1” (fork’s error result) as a pid.

      Do you kill(pid, signal)? Maybe you do kill(pid, 9).

      Do you know what happens when pid is -1? You really should. It’s Important. Yes, with a capital I.

      If pid equals -1, then sig is sent to every process for which the calling process has permission to send signals, except for process 1 (init)

      1. 4

        Fork+exec is indeed a liability these days.

        1. 6

          A critique of fork() from microsoft is disappointing since these systems abstractions aren’t universal truths and can only be judged meaningfully within the context of the cultures that created them. UNIX was designed by a culture of people who liked to create and compose lots of small programs and fork() works fine if that’s your preferred development methodology. Microsoft culture is to build big sophisticated monolithic programs like Word. Since Linux became so successful, all the people who come from those different cultures are now being asked to use it, and they’re not happy, because it was designed to accommodate a style that differs from their own.

          1. 25

            A critique of fork() from microsoft is disappointing since these systems abstractions aren’t universal truths and can only be judged meaningfully within the context of the cultures that created them.

            I don’t think an ad hominem is a very helpful start to a post, but I partially agree with the second part of this sentence. Systems abstractions are not universal truths but their value is universal within the context of the hardware on which they need to run. This changes over time and across market segments (abstractions for rack-scale computing don’t work well on embedded systems with significantly less than 1 MiB of RAM, for example).

            I work for Microsoft Research, but I held the opinion that fork is a terrible abstraction for at least 10, probably closer to 15, years before I joined Microsoft and it has absolutely nothing to do with Windows or the VMS / Mainframe culture. Prior to joining Microsoft I served two terms on the FreeBSD Core Team, so I hope that establishes my credentials as someone who understands the ‘UNIX culture’. I’ve also worked on a port of Linux to an environment with no MMU that aimed to run existing Linux software.

            These folks are not the only people to object to fork. Mothy has been complaining about it for years and there’s a huge section in the UNIX Haters’ Handbook (1994) about it.

            It’s easy to understand fork if you understand the hardware context in which it was created, which has nothing to do with the culture of the programmers. On PDP machines, you had one process executing in core memory and when you did a context switch you wrote it out to drum (or similar) memory and read another process in. On this hardware, fork is an obvious abstraction: you write out the process and then create a new copy of the kernel data structures. You now have a copy of the process that you’ve just written out sitting in memory for free, doing anything other than fork is more work.

            The fork abstraction started showing its age quite early. Once UNIX ran on systems with MMUs, you could keep two or more processes in core (a term that remains, even though core memory is a historical curiosity) at a time and context switching became just a matter of updating the segment descriptor table (or pointing to a different one). At this point, fork required copying the entire process and was expensive. With paged MMUs, it became a bit cheaper as you could ‘just’ mark every page as copy-on-write, but that complicated the VM subsystem. Two other advances made it even worse:

            • Threads were added to UNIX. Only the calling thread is copied by fork, so you many end up with locks held by other threads that you can’t unlock (especially if they’re error-checking pthread mutexes and abort if you try to unlock them from a thread that doesn’t own them), so if you want to actually use the forked child rather than just call execve then you need every library that creates threads to be aware of this and release locks then recreate its threads in the child, which is basically impossible to get right.
            • SMP came along and now updating page tables required cross-core synchronisation. This is particularly bad on x86 (except AMD Milan) or RISC-V, where you need an IPI to do TLB invalidates, so the cost of fork scales badly with both the size of the process and the number of cores. Ona system with 128 cores and 1TiB of RAM, fork does a phenomenal amount of work to create a virtual memory environment that often lives for less than 1ms and which doesn’t touch more than a couple of pages.

            Awareness of the pain of fork is not new. The vfork system call was introduced somewhere around 2.9BSD or 3BSD, putting it in or before 1980. The FreeBSD 1.0 (1992) man page says that vfork will be removed in a future release. In FreeBSD 10 (2014), we quietly removed that comment because vfork remains the best way of creating a child process on *NIX, in spite of being incredibly hard to use correctly (it can, at least, be used to create APIs that can be used easily).

            There are basically two ways of creating a process that don’t suffer from the problems of fork.

            The first (which is the direction VMS / NT went) is conceptually cleaner. You have a kernel API that creates a new empty process and returns a capability to it (which Windows calls a handle). You can use this capability to do things like map files or anonymous memory objects into the process, write initial state into that memory (e.g. stack contents), inject threads, and so on. The big disadvantage of this is that it requires remote versions of all of your APIs. On *NIX, mmap is how you modify your process’s address space but it doesn’t let you modify another process’s address space. Similarly, pthread_create lets you create a thread but it doesn’t let you create a thread in another process.

            In most *NIX systems, there’s a separation of concerns around threading where the kernel doesn’t know much about the thread other than that it is a schedulable entity and it has a register file that needs to be context switched when it is [de]scheduled. The threading library builds the pthread abstractions in userspace and the same kernel can run multiple different threading libraries. That kind of design makes it a bit more difficult to implement this model because even if there were a remote form of mmap and newthr / clone + CLONE_THREAD, the userspace threading library would have to be restructured to use these to inject remote threads and do things like copy TLS segments from the binary. This is simpler with Windows where there is precisely one threading library but that simplicity comes at the cost of flexibility.

            Designing these APIs to avoid confused deputy vulnerabilities is also nontrivial. If I create a new process owned by a different user, or with reduced privileges in some other way, then I need to ensure that any OS handles that I’ve created for the child were either with the reduced rights or were intentionally created with elevated privileges.

            The other approach is embodied by vfork. You let a process run with a different set of kernel state other than the memory mapping and then launch the new process once it’s modified the state that it wants to. This means that you don’t need remote versions of open, close, socket, and so on, because you’re setting up the kernel state using the local versions of the calls. The BSD incarnation of this, via vfork is a bit limited in that the only way to end a vfork context is via execve, which creates a completely new memory map, so you can’t use this to set up shared memory regions between the parent and child. You also can’t create threads in the child. These things could be built with cooperation from the run-time linker, but they aren’t in any *NIX systems that I’m aware of.

            The advantage of vfork is that there’s a clear separation between things done with the parent’s rights and things done with the child’s rights. You can do things like call setuid or cap_enter in the vfork context and then anything done after this is done with the child’s permissions context. Copying the file descriptor table can still be quite costly, especially given that most child processes immediately call closefrom to close most of them after putting the small number that they want to explicitly inherit in the right places in the file descriptor table.

            The big downside of fork is that it can affect process state. For example, if you accidentally call malloc in the vfork context and then don’t free the memory, it’s leaked in the parent. This isn’t so bad with C++ and RAII, where you can allocate space on the stack for the arguments to execve and then do everything else in a nested block so that destructors all run before the execve call, but it’s painful to use in C, which is the language that UNIX was co-designed with. It’s probably fine in a garbage-collected language, as long as you don’t accidentally acquire any locks in the vfork context.

            Both of these approaches work well with any kind of process-isolation technology and even without an MMU. In the CHERI project, we’ve extended FreeBSD with a coexecve system call that creates a new OS process within the same address space as the parent, isolated using CHERI capabilities. This works well with vfork (we added a sysctl that turns any execve in a vfork context into a coexecve and most things seem to just work) but would be impossible to support with fork.

            TL;DR: The fact that fork is a bad design is fairly uncontroversial among systems researchers and has been well-known to UNIX kernel developers since before I was born.

            1. 1

              Two other ideas rattling around in my head, either of which could be implemented in user space:

              1. There’s a spawn() function which takes a pointer to some bytecode that will change kernel state. And while we’re at it, a pointer to a list of which file descriptors to share with the parent. Increment refcounts only for the listed descriptors.

              2. Have a spawn() syscall and also keep execve(). You spawn a wrapper process that will do all the kernel state manipulation and then execve() the final target.

              1. 3

                There’s a spawn() function which takes a pointer to some bytecode that will change kernel state

                This is more or less how posix_spawn is implemented on platforms that have a native implementation. The XNU implementation has some non-standard extensions that let you do things like change security context. It’s not really clear to me that this would be more ergonomic than allowing arbitrary code to run in a vfork context. FreeBSD doesn’t bother implementing posix_spawn in the kernel because it’s easy to implement in userspace: each of the spawn actions corresponds to a system call, you just do them all in a row.

                And while we’re at it, a pointer to a list of which file descriptors to share with the parent. Increment refcounts only for the listed descriptors.

                Something like this is top of my list for unprivileged process creation: being able to provide an array of pairs of file descriptors and where they should go in the child process’s FD table. This would avoid the little sequence I have to do today of dup all of the file descriptors until none of them are in the range I want to set, dup2 them into the right place, and then closefrom all of the others.

                Have a spawn() syscall and also keep execve(). You spawn a wrapper process that will do all the kernel state manipulation and then execve() the final target.

                I think the thing missing from this is an madvise (or similar) flag to preserve some bits of the memory map on exec. I would love to be able to create a shared mapping by mapping anonymous memory in the parent, doing something like mshare(base, length) and having that range preserved across execve.

                That said, I’m not a huge fan of execve as a kernel API. Requiring the kernel to understand ELF and map segments of a binary feels like a poor separation of concerns. I’d love to move all of this into userspace. This would require extending procctl to set the other bits of the ABI (signal frame layout, system call table to use) and an equivalent of closefrom for virtual memory: something to unmap everything that wasn’t explicitly set as preserved in the child context. You might want to be able to set some mappings as preserved (or, ideally, created) only within the child context, so that you didn’t need to map things and then unmap them after vfork returned and so that you could map some things in addresses that are already in use. With this, you could then set up the newly-created child’s stack, load its binary (and the run-time linker if it’s dynamically linked), and then drop mappings that the parent had and run from there.

                1. 1

                  That said, I’m not a huge fan of execve as a kernel API

                  Agree with you on this, but I think you have to lose a bunch of weird Unix features like suid and sgid executables and setcap, in which the kernel taking part is effectively mandatory?

                  I’m open to arguments that suid/sgid and friends are not things we want to keep in future anyway.

                  Other than that it would be cute to do all the loading of the new program into memory in user space instead.

              2. 1

                The first (which is the direction VMS / NT went) is conceptually cleaner. You have a kernel API that creates a new empty process and returns a capability to it (which Windows calls a handle). You can use this capability to do things like map files or anonymous memory objects into the process, write initial state into that memory (e.g. stack contents), inject threads, and so on. The big disadvantage of this is that it requires remote versions of all of your APIs

                This is what I’ve built for Linux in https://github.com/catern/rsyscall

                1. 3

                  I don’t see the kernel module. Are you just executing the syscall via an RPC in the child? This doesn’t address the problem, because you want to be able to do this in a completely empty process (for bootstrapping the first bit of code and initial stack) and you need to be able to do it with a different set of privileges. The main context in which the NT-style interfaces are useful today is sandboxing, where a more-privileged process is acting on behalf of the process that wants to do the system call. You can kind-of do this with *NIX APIs (in fact, I have) by using the ability to send file descriptors over UNIX domain sockets, but it’s not portable. In post-5.13 kernels, Linux has a mechanism in seccomp-bpf that lets you intercept a syscall that returns a file descriptor from another process, handle it, and provide the returned file descriptor, but it’s incredibly difficult to do securely (avoiding time-of-check-to-time-of-use errors requires a copy, which requires multiple domain transitions and ends up being slower with kernel help than my pure-userspace code, it’s only useful if you want to run unmodified programs under seccomp-bpf).

                  Even with all of this hoop-jumping you can’t, for example, create a read-write mapping of a 4 KiB window into a file that the unprivileged process shouldn’t have access to. You have to pass it the file descriptor with read-write permissions and then do mmap in the unprivileged process. At this point it can map any location and size within the file. In contrast, Win32’s MapViewOfFile3 lets you do exactly this, as long as you have a handle to the unprivileged process that you can pass to the second argument. Similarly, VirtualAlloc2 allows you to map anonymous memory into a process’ address space so that you can create the initial mapping for the first thread’s stack.

                  1. 3

                    Are you just executing the syscall via an RPC in the child?

                    Yes. That’s what a kernel module would do too, effectively (send a syscall to a process and wait for that process to execute it, possibly yielding to it). That’s the only viable way to add this to a Unix kernel - the kernel just isn’t set up for the alternative (modifying a process structure without running kernel code in that process’s context).

                    This doesn’t address the problem, because you want to be able to do this in a completely empty process (for bootstrapping the first bit of code and initial stack)

                    Not necessarily. What you can do is create a process which starts out sharing everything with its parent, and then gradually unshare things as you set them up. That’s equivalent; note that even on NT the new process is not “completely empty”, it inherits many things from the parent process like security contexts.

                    For initial address space set up, there’s exec. I agree it would be neat if there was a Unix API that allowed you to switch between and create fresh address spaces other than with exec; then you wouldn’t need exec, you could do it in userspace. But exec is the API that Unix has for switching and creating fresh address spaces - something more flexible isn’t absolutely necessary.

                    Even with all of this hoop-jumping you can’t, for example, create a read-write mapping of a 4 KiB window into a file that the unprivileged process shouldn’t have access to. You have to pass it the file descriptor with read-write permissions and then do mmap in the unprivileged process.

                    Sure, of course. The Unix API is based around files. If you want to give a process access to a piece of memory, you have to give it access to that file. There’s no ability to give it access to only part of a file - that’s a different feature, and if we want that, it should be an orthogonal feature. Does the Win32 API let the process then send that mapping on to child processes or other processes? If yes, then it’s just a feature that Unix lacks, since Unix doesn’t have only-part-of-file capability file descriptors. If no, then it’s not a good feature - delegation is fundamental for allowing abstraction, and this is undelegatable.

              3. 14

                This isn’t a critique of fork() from Microsoft – it’s a critique of fork() of four operating systems researchers, one of whom is affiliated by Microsoft. Would your comment be different if the submitter used a different URL to the paper, like this one (ETH is an university in Switzerland where the fourth author works): https://people.inf.ethz.ch/troscoe/pubs/hotos_fork.pdf

                The paper discusses how fork came to be, providing insights what the small community though was good at the time.

                1. 7

                  It’s also a style that differs wildly from a lot of real world Linux/modern *nix usage (including most cases that are meaningfully sensitive to process spawning overhead). The authors do acknowledge (in section 3) that this was a reasonable design for its original use case in its original context. But that was a context with different hardware, different available libraries/other os abstractions, different demands from user level programs, etc.

                  But this also isn’t otherwise a lazy internet rant; it’s a pretty through exploration of the topic, including alternatives (they point out that rfork()/clone() solves some problems, but not others). I find it disappointing to see a pretty thoughtful paper dismissed with “bah, Microsoft.”

                  Which, fwiw, only even applies to one of the authors. Jonathan Appavoo was my advisor as a grad student and I later worked for Orran Krieger for a couple years; I can attest that they are not “Microsoft people.”

                  1. 0

                    You’re missing the point. fork() is a privilege. It can’t be used if you work in an environment where devs are drowning in technical debt that was shoveled downstream for decades, because your privilege was taken away. Don’t let them tell you that’s modern or that it’s good for us.

                    1. 9

                      fork() is a privilege.

                      If the paper is correct in arguing that fork() isn’t an inspired design, but merely a clever hack that was good for its time, then it follows that it’s not a privilege to be enshrined and defended against encroaching complexity, but merely a tool to be used when it made sense and dropped now that it doesn’t.

                      I think the last time I used fork() and the exec family was in a moderately complex C program in the early 2000s. Granted, this program was multithreaded, so it probably doesn’t live up to the platonic Unix ideal. But hey, we were trying to write a working program in a reasonable time (as a side project), not show true devotion to the Unix Way. And I’m pretty sure we had a few bugs related to process spawning. So I don’t miss fork() in the least; I’m happy to use a higher-level process spawning API that avoids all the footguns of fork() and exec and works smoothly on Windows.

                      See also: Free Your Technical Aesthetic from the 1970s

                  2. 5

                    I just want to note that this isn’t only a Microsoft “extinguish” thing. OS research in recent years is very Unix (even primarily Linux)-centric, seeming like we have forgotten other OSes than Linux. Mothy addressed this at this years’ OSDI as well. I’m not saying that Microsoft didn’t have any sneaky intentions with this paper, but there is a valid critique here.

                Stories with similar links:

                1. A fork() in the road via fanf 1 year ago | 17 points | 41 comments