It’s a shame that posix_spawn is such a horrible interface. On Windows, most of the APIs that manipulate a process have Ex variants that take a process HANDLE as an argument, so anything you can do to your own process, you can do to a newly created child. You can create a new process, set up whatever you need, and then On *NIX, you achieve something similar by using fork or vfork (or rfork / clone if you want more fine-grained control at the expense of portability) and then the parent code can run arbitrary code in the child to configure the environment (e.g. set resource limits, close or renumber file descriptors [all parent FDs are inherited by default], set sandboxing policies, map shared memory segments or files into the address space, and so on) before calling execve to start a new program - the first code that runs that isn’t under the control of the process creating the task. In both models, the process creating the child has complete control over the process environment and can do anything that it wants to before running any code that isn’t under the control of the parent.
In contrast, posix_spawn is a horrible compromise API. It was designed to be possible to implement entirely in userspace and so cannot, by definition, express anything that you can’t do with vfork and existing APIs. The article makes a dig at FreeBSD having a ‘slow’ implementation in userspace, but I wonder if that’s actually backed up by any data. The FreeBSD implementation of posix_spawn uses vfork. This is a very cheap system call, which creates the kernel state for a new process (e.g. a copy of the file descriptor table) but then resumes executing in the parent. System calls in the parent that update process state in the kernel then do so on new process. The attributes that you can pass to posix_spawn are fairly limited.
In a clean API for UNIX process creation, you’d pass it a vector of file descriptors and the file descriptor table would be initialised to contain only those. With posix_spawn, you have to implement this in terms of explicit close and duplicate file operations. Expressing this in the declarative style of posix_spawn is significantly more complex than doing so in code. If you have an in-kernel posix_spawn implementation then you can batch these manipulations up and avoid multiple system calls and acquiring the (uncontended) file descriptor lock multiple times, but that overhead is not likely to be very large. You are quite limited in expressiveness. For example, I can’t find a single system that provides a closefrom file action for posix_spawn, yet that’s absolutely essential for securely and safely creating child processes (once you’ve initialised the file descriptors that the child should have, you need to close all of the others). Even on systems that don’t natively provide closefrom you can close all file descriptors between the largest that you need and the max number (which may be dynamic - most systems at least give you a way to query the highest number that you have open: in Linux you can find this in /proc). You can’t do that in posix_spawn because it’s racy: another thread in the parent can open and close file descriptors while you’re setting up the file actions. In vfork, you start with a copy of the parent’s FD table and can then modify it without worrying about concurrency.
This is made significantly worse by the fact that you often want to do things in a specific order. For example, if you’re doing any kind of privilege separation then you may want to open a file in read-only mode before changing the UID, or in Capsicum mode you may wish to create and bind a socket (oh, sorry, posix_spawn‘s file_ops can open files but not sockets) before entering capability mode. With vfork, that’s trivial, but posix_spawn’s file operations are not ordered with respect to other things that modify process state.
It’s a shame that
posix_spawn
is such a horrible interface. On Windows, most of the APIs that manipulate a process have Ex variants that take a process HANDLE as an argument, so anything you can do to your own process, you can do to a newly created child. You can create a new process, set up whatever you need, and then On *NIX, you achieve something similar by usingfork
orvfork
(orrfork
/clone
if you want more fine-grained control at the expense of portability) and then the parent code can run arbitrary code in the child to configure the environment (e.g. set resource limits, close or renumber file descriptors [all parent FDs are inherited by default], set sandboxing policies, map shared memory segments or files into the address space, and so on) before callingexecve
to start a new program - the first code that runs that isn’t under the control of the process creating the task. In both models, the process creating the child has complete control over the process environment and can do anything that it wants to before running any code that isn’t under the control of the parent.In contrast,
posix_spawn
is a horrible compromise API. It was designed to be possible to implement entirely in userspace and so cannot, by definition, express anything that you can’t do withvfork
and existing APIs. The article makes a dig at FreeBSD having a ‘slow’ implementation in userspace, but I wonder if that’s actually backed up by any data. The FreeBSD implementation ofposix_spawn
usesvfork
. This is a very cheap system call, which creates the kernel state for a new process (e.g. a copy of the file descriptor table) but then resumes executing in the parent. System calls in the parent that update process state in the kernel then do so on new process. The attributes that you can pass toposix_spawn
are fairly limited.In a clean API for UNIX process creation, you’d pass it a vector of file descriptors and the file descriptor table would be initialised to contain only those. With
posix_spawn
, you have to implement this in terms of explicit close and duplicate file operations. Expressing this in the declarative style ofposix_spawn
is significantly more complex than doing so in code. If you have an in-kernelposix_spawn
implementation then you can batch these manipulations up and avoid multiple system calls and acquiring the (uncontended) file descriptor lock multiple times, but that overhead is not likely to be very large. You are quite limited in expressiveness. For example, I can’t find a single system that provides aclosefrom
file action forposix_spawn
, yet that’s absolutely essential for securely and safely creating child processes (once you’ve initialised the file descriptors that the child should have, you need to close all of the others). Even on systems that don’t natively provideclosefrom
you can close all file descriptors between the largest that you need and the max number (which may be dynamic - most systems at least give you a way to query the highest number that you have open: in Linux you can find this in/proc
). You can’t do that inposix_spawn
because it’s racy: another thread in the parent can open and close file descriptors while you’re setting up the file actions. Invfork
, you start with a copy of the parent’s FD table and can then modify it without worrying about concurrency.This is made significantly worse by the fact that you often want to do things in a specific order. For example, if you’re doing any kind of privilege separation then you may want to open a file in read-only mode before changing the UID, or in Capsicum mode you may wish to create and bind a socket (oh, sorry,
posix_spawn
‘sfile_ops
can open files but not sockets) before entering capability mode. Withvfork
, that’s trivial, butposix_spawn
’s file operations are not ordered with respect to other things that modify process state.