Many of these problems are at least partially solved on other UNIX systems like illumos. To run down the top-level list:
1. It’s easy for processes to leak
illumos has contracts (see contract(5), libcontract(3LIB), and 3CONTRACT), which were developed as part of the Service Management Facility (SMF). They are (yet another?) process grouping abstraction that allows SMF to track trees of processes, whether they daemonise or not. They allow for tracking or ignoring certain events (e.g., a fatal signal sent from outside the contract, a process within the contract that aborts and dumps core, if a particular process, or all processes, terminate in some way) and for doing certain kinds of automatic cleanup (e.g., terminate all processes in the contract when the process that owns the contract is terminated). These are managed by the kernel, so they are effectively inescapable even when not held correctly by a user process.
2. It’s impossible to prevent malicious processes leaks
This is not really true for us either. Between contracts, and resource controls, and privileges, I expect one would be able to limit the malicious or accidental escape of processes from supervision or any run-away resource consumption caused by, say, a fork bomb.
3. Processes have global, reusable IDs
This is true on some level, but in practice I think that’s just part of UNIX and when you have solved 1-2 and 4 in other ways, it’s not actually that bad. If you want to kill everything in a contract you own, even without knowing the full list of pids, you can do that with ct_ctl_abandon(3CONTRACT) which takes a contract file descriptor. The termination action (which could be to tear down all of the processes) will take effect then.
4. Process exit is communicated through signals
This is not entirely true, in the sense that they are by default for classic UNIX applications – but they need not be. We have forkx(2), which has the FORK_NOSIGCHLD and FORK_WAITPID flags. These request that SIGCHLD is not posted for process termination, and that the classic UNIX wait(2) family of calls will not receive notification or reap children. You can use these from a library, and then manage your own waitid(2) or waitpid(3C) calls on the specific process IDs you are responsible for reaping. You can also use contracts to receive notifications of events about processes within the contract coming and going.
The mechanisms have different names but are quite similar on FreeBSD. The big problem on Linux is that it conflates threads and processes in the kernel: threads are processes that share an address space and a file descriptor table. On both Linux and FreeBSD, after vfork and before exec, you can mark a process to terminate when its parent exits, which avoids it leaking. On Linux, the conflation of processes and threads means that this causes the child process to exit as soon as the thread that created it in the parent exits, making the API completely useless.
From your description, the analogue of contracts on FreeBSD is reapers. Each process has a reaper: the process that’s responsible for killing it and which is notified when it exits. This is used for supervising process trees.
Every time I have to write Linux code, I am reminded of how much worse it is than pretty much every alternative.
It’s even weirder: Threads on Linux don’t even have to share a file descriptor table, and processes can share a file descriptor table and even their address space.
Using clone3, You can make all kinds of fun Franken-tasks that live in the twilight zone between thread and process.
Sorry, when I say ‘threads’ and ‘processes’, I mean the things that are created with the POSIX APIs (fork, vfork, pthread_create). Plan 9 also has rfork, which allows sharing of various things between the parent and child. FreeBSD adopted this, and I think some other *NIXes did as well. Linux’s clone is somewhat more flexible.
I don’t believe they do. Process contracts in particular are designed with a core goal of not requiring anything but the OS-provided supervisor to know about them. Things like wait and process groups and sessions and all the classic UNIX stuff fit within a contract, which is then a more robust mechanism for managing a collection of processes that don’t need to know about that management.
No, Linux also has better APIs for this than classic UNIX or POSIX - such as the aforementioned pidfd.
The problem really is that the lowest common denominator that is POSIX offers no safe, reliable API for processes, just like it doesn’t for many other things.
If you target only Linux, you can have a decent solution, but you’ve cut off some large part of your potential userbase.
If you target only one of the BSDs or only Solaris, you’ve cut off the vast majority of people.
If you target generic POSIX, you’re stuck with the unreliable and unsafe old APIs.
If you target several different OSes individually, you can probably have decent semantics on each, but you’re multiplying implementation complexity.
To me, targeting Linux specifically seems like the least bad option out of those. But I’m biased because everything I run is Linux already.
Sortix does that on platforms that have 64-bit pointers. It requires some patches to software, as a lot of things assume pid_t can fit in an int, and as far as I understand there’s not been large-scale stress tests with a high starting PID so there might be further bad code that is not yet been caught due to using explicit casts.
FWIW a shell is basically two alternating, non-overlapping event loops:
a select() loop on the input terminal FD for getting keystrokes (e.g. GNU readline)
the waitpid(-1) loop for running code, i.e. get the next process that exited
It never actually does both at the same time – it doesn’t wait for processes and stream input simultaneously, which is awkward without the self-pipe trick.
Regarding adversarial processes, yes you need something like Linux cgroups to solve that problem. In traditional Unix, a process that can run arbitrary code can always escape your attempts to kill it.
IIRC you can start a Linux process in a freezer cgroup, and stop everything in the cgroup. I recall reading the docs for an HPC platform that does that, and I’m sure Docker does it in some way too.
I’d be interested in where supervise is used in production … it seems like there is a bigger story behind this article!
It never actually does both at the same time – it doesn’t wait for processes and stream input simultaneously, which is awkward without the self-pipe trick.
Hmm, when I run sleep 5 & in zsh, it prints out
[1] + done sleep 5
On its own 5 seconds later, with no input from me. Same happens if I say /bin/sleep, so it’s not just builtin magic.
I noted there that zsh gives you the notification immediately. I also just tested fish and it does the same thing.
So now my question is if they use the self-pipe trick to make it less awkward – I would guess almost certainly not. Probably
in the select() loop, you just handle signals like SIGCHLD too. We actually have some of that in Oil
in the waitpid(-1) loop, you still don’t need to handle file stream input. This would definitely be more awkward, and is a limitation of shell runtimes I’m thinking about
Does MacOS do stuff to help with cleaning up orphan processes that, for example, Linux won’t do? Over the years I’ve used multiple tools where I would end up with process leaks but only on my Linux machines. Meanwhile Mac-using coworkers would not have issues (this would be of the “file change means we should kill process and restart it but for some reason on Linux I’d end up with multiple process restarts”).
I think so, but I’m not at all qualified to answer. I know that a lot of the low level process stuff is Mach based, and I’m pretty sure there are hierarchies where the death of a parent also kills all the children. (There is or used to be a ‘loginwindow’ process that handles GUI login, and if that process ever crashed, your whole GUI session went kablooie.) And I believe the messages relating to child processes are sent with Mach messages, not signals.
Other than procdesc (predating Linux pidfd by a long time, but still not implementing the synchronous wait pdwait4, but seems that’s not the most wanted functionality), which was subtly mentioned in the very end of the article, FreeBSD also has a very advanced reaper API under procctl(2).
I have just played around with it and made a demo. Basically every child process of a reaper is automatically a proper kernel-tracked process grouping out of which there’s no escape :) i.e. a reaper can supervise multiple groups independently, and e.g. reliably kill all descendants of one service when the service dies. Don’t worry about the PIDs being used as identifiers there; they’re only compared against the p_reapsubtree field in the kernel which is reaper-specific, so nothing would happen if some random process gets to reuse one of those child PIDs. (If somehow the next child we spawn does there would be trouble?… but as we’re the reaper, we can avoid reaping until we’re done with the group, which will be 100% guaranteed to make trouble impossible.)
There’s no pdwait4, but kqueue can monitor process descriptors. This doesn’t give you the usage structure, but it lets you wait for any of a set of process descriptors to terminate.
Yeah, that’s what I meant by “seems that’s not the most wanted functionality”. (Also poll and select work too.) The sync wait is mostly useful for reapers, but procdesc is more for regular applications.
Also from procctl: PROC_PDEATHSIG_CTL lets you deliver a signal when a parent exits. If you set this between fork and exec then you control the signal that’s sent to your child when you exit. You can set this to a signal that causes termination and it’s then inherited by the rest of the process tree. A malicious child can reset it, but sometimes that’s desirable behaviour (a child has to explicitly opt into outliving its parent, if you want to disallow that then the reaper APIs are the right choice).
Huh, apparently the Linux version of it (mentioned in the article) does not get inherited (fork manpage says it’s reset) hence the complaints. Weird, why don’t both systems offer a flag to make it inherited or not…
Many of these problems are at least partially solved on other UNIX systems like illumos. To run down the top-level list:
1. It’s easy for processes to leakillumos has contracts (see contract(5), libcontract(3LIB), and 3CONTRACT), which were developed as part of the Service Management Facility (SMF). They are (yet another?) process grouping abstraction that allows SMF to track trees of processes, whether they daemonise or not. They allow for tracking or ignoring certain events (e.g., a fatal signal sent from outside the contract, a process within the contract that aborts and dumps core, if a particular process, or all processes, terminate in some way) and for doing certain kinds of automatic cleanup (e.g., terminate all processes in the contract when the process that owns the contract is terminated). These are managed by the kernel, so they are effectively inescapable even when not held correctly by a user process.
2. It’s impossible to prevent malicious processes leaksThis is not really true for us either. Between contracts, and resource controls, and privileges, I expect one would be able to limit the malicious or accidental escape of processes from supervision or any run-away resource consumption caused by, say, a fork bomb.
3. Processes have global, reusable IDsThis is true on some level, but in practice I think that’s just part of UNIX and when you have solved 1-2 and 4 in other ways, it’s not actually that bad. If you want to kill everything in a contract you own, even without knowing the full list of pids, you can do that with ct_ctl_abandon(3CONTRACT) which takes a contract file descriptor. The termination action (which could be to tear down all of the processes) will take effect then.
4. Process exit is communicated through signalsThis is not entirely true, in the sense that they are by default for classic UNIX applications – but they need not be. We have forkx(2), which has the
FORK_NOSIGCHLD
andFORK_WAITPID
flags. These request thatSIGCHLD
is not posted for process termination, and that the classic UNIX wait(2) family of calls will not receive notification or reap children. You can use these from a library, and then manage your own waitid(2) or waitpid(3C) calls on the specific process IDs you are responsible for reaping. You can also use contracts to receive notifications of events about processes within the contract coming and going.The mechanisms have different names but are quite similar on FreeBSD. The big problem on Linux is that it conflates threads and processes in the kernel: threads are processes that share an address space and a file descriptor table. On both Linux and FreeBSD, after vfork and before exec, you can mark a process to terminate when its parent exits, which avoids it leaking. On Linux, the conflation of processes and threads means that this causes the child process to exit as soon as the thread that created it in the parent exits, making the API completely useless.
From your description, the analogue of contracts on FreeBSD is reapers. Each process has a reaper: the process that’s responsible for killing it and which is notified when it exits. This is used for supervising process trees.
Every time I have to write Linux code, I am reminded of how much worse it is than pretty much every alternative.
It’s even weirder: Threads on Linux don’t even have to share a file descriptor table, and processes can share a file descriptor table and even their address space. Using
clone3
, You can make all kinds of fun Franken-tasks that live in the twilight zone between thread and process.Sorry, when I say ‘threads’ and ‘processes’, I mean the things that are created with the POSIX APIs (fork, vfork, pthread_create). Plan 9 also has
rfork
, which allows sharing of various things between the parent and child. FreeBSD adopted this, and I think some other *NIXes did as well. Linux’s clone is somewhat more flexible.Do any of those features break POSIX compatibility? Curious because I think POSIX is holding Unix-likes back.
I don’t believe they do. Process contracts in particular are designed with a core goal of not requiring anything but the OS-provided supervisor to know about them. Things like wait and process groups and sessions and all the classic UNIX stuff fit within a contract, which is then a more robust mechanism for managing a collection of processes that don’t need to know about that management.
Sounds great. Thanks for sharing, I want to check it out now.
The article should be titled LINUX Process API is Unreliable and Unsafe.
The article never mentions any BSD/Solaris/AIX or any other UNIX.
No, Linux also has better APIs for this than classic UNIX or POSIX - such as the aforementioned pidfd.
The problem really is that the lowest common denominator that is POSIX offers no safe, reliable API for processes, just like it doesn’t for many other things.
If you target only Linux, you can have a decent solution, but you’ve cut off some large part of your potential userbase. If you target only one of the BSDs or only Solaris, you’ve cut off the vast majority of people. If you target generic POSIX, you’re stuck with the unreliable and unsafe old APIs. If you target several different OSes individually, you can probably have decent semantics on each, but you’re multiplying implementation complexity.
To me, targeting Linux specifically seems like the least bad option out of those. But I’m biased because everything I run is Linux already.
Thank You for explanation. Regards.
To be fair, the problems it describes are true of classical POSIX system; it’s just the solutions that are Unix-specific.
Ok, thanks.
If we just bumped pids to 64 bits, we could create a million process a second without recycling ids for…
(2^63)/(1000000x60x60x24x365) ~= 292 thousand years. Obviously migration is a PITA, but this really seems like the ultimate solution.
I look forward to typing
kill -9 9203847209374947
.Sortix does that on platforms that have 64-bit pointers. It requires some patches to software, as a lot of things assume pid_t can fit in an int, and as far as I understand there’s not been large-scale stress tests with a high starting PID so there might be further bad code that is not yet been caught due to using explicit casts.
Collision probability of 2^64 is 50% at 2^32. You’d want 2^128 for 50% at 2^64
DJB’s self pipe trick solves the awkwarness of #4 – waiting for a process exits plus other events non-deterministically:
https://cr.yp.to/docs/selfpipe.html
FWIW a shell is basically two alternating, non-overlapping event loops:
It never actually does both at the same time – it doesn’t wait for processes and stream input simultaneously, which is awkward without the self-pipe trick.
Regarding adversarial processes, yes you need something like Linux cgroups to solve that problem. In traditional Unix, a process that can run arbitrary code can always escape your attempts to kill it.
IIRC you can start a Linux process in a freezer cgroup, and stop everything in the cgroup. I recall reading the docs for an HPC platform that does that, and I’m sure Docker does it in some way too.
I’d be interested in where
supervise
is used in production … it seems like there is a bigger story behind this article!Cgroups might be able to do the trick somehow via the freezer controller, but I think the cleaner and more idiomatic way would be a PID namespace.
Hmm, when I run
sleep 5 &
in zsh, it prints outOn its own 5 seconds later, with no input from me. Same happens if I say /bin/sleep, so it’s not just builtin magic.
Yes great point! I over-generalized about shells, I should have said bash/dash/mksh. In fact I just looked up the issue here:
https://github.com/oilshell/oil/issues/1093
I noted there that zsh gives you the notification immediately. I also just tested fish and it does the same thing.
So now my question is if they use the self-pipe trick to make it less awkward – I would guess almost certainly not. Probably
Does MacOS do stuff to help with cleaning up orphan processes that, for example, Linux won’t do? Over the years I’ve used multiple tools where I would end up with process leaks but only on my Linux machines. Meanwhile Mac-using coworkers would not have issues (this would be of the “file change means we should kill process and restart it but for some reason on Linux I’d end up with multiple process restarts”).
I think so, but I’m not at all qualified to answer. I know that a lot of the low level process stuff is Mach based, and I’m pretty sure there are hierarchies where the death of a parent also kills all the children. (There is or used to be a ‘loginwindow’ process that handles GUI login, and if that process ever crashed, your whole GUI session went kablooie.) And I believe the messages relating to child processes are sent with Mach messages, not signals.
Other than procdesc (predating Linux pidfd by a long time, but still not implementing the synchronous wait
pdwait4
, but seems that’s not the most wanted functionality), which was subtly mentioned in the very end of the article, FreeBSD also has a very advanced reaper API under procctl(2).I have just played around with it and made a demo. Basically every child process of a reaper is automatically a proper kernel-tracked process grouping out of which there’s no escape :) i.e. a reaper can supervise multiple groups independently, and e.g. reliably kill all descendants of one service when the service dies. Don’t worry about the PIDs being used as identifiers there; they’re only compared against the
p_reapsubtree
field in the kernel which is reaper-specific, so nothing would happen if some random process gets to reuse one of those child PIDs. (If somehow the next child we spawn does there would be trouble?… but as we’re the reaper, we can avoid reaping until we’re done with the group, which will be 100% guaranteed to make trouble impossible.)There’s no pdwait4, but kqueue can monitor process descriptors. This doesn’t give you the usage structure, but it lets you wait for any of a set of process descriptors to terminate.
Yeah, that’s what I meant by “seems that’s not the most wanted functionality”. (Also poll and select work too.) The sync wait is mostly useful for reapers, but procdesc is more for regular applications.
Also from
procctl
:PROC_PDEATHSIG_CTL
lets you deliver a signal when a parent exits. If you set this between fork and exec then you control the signal that’s sent to your child when you exit. You can set this to a signal that causes termination and it’s then inherited by the rest of the process tree. A malicious child can reset it, but sometimes that’s desirable behaviour (a child has to explicitly opt into outliving its parent, if you want to disallow that then the reaper APIs are the right choice).Huh, apparently the Linux version of it (mentioned in the article) does not get inherited (fork manpage says it’s reset) hence the complaints. Weird, why don’t both systems offer a flag to make it inherited or not…
As a heads up, the link goes to HTTP though the HTTPS version is available.