Windows uses structured exception handling (SEH) or vectored exception handling (VEH) for this. SEH was covered by a bunch of patents owned by Borland and so *NIX systems didn’t copy it (they expired about 10 years ago, so probably could). There are three bits to SEH:
First, there are some C extensions so that languages that don’t have native support for exceptions can handle them. You use __try / __except / __finally in C to describe SEH blocks.
Next, there’s an ABI and unwind library. Unlike the Itanium unwinder, SEH was designed to avoid needing any heap allocations. When an SEH exception is raised, the exception object is allocated on the stack and then passed down to the unwinder. The unwinder then walks up the stack[1] and, for each frame with cleanup (or catch) behaviour, it calls a funclet. This is a function that runs with a pointer to the stack frame for the function that it’s cleaning up. If the funclet is a catch, then it may move the exception object into the space reserved to hold it in the catching stack frame.
Finally, there’s support in the kernel to push a call into the library routine that starts this process on top of the stack. For anything that would be a signal on *NIX, Windows will throw an SEH exception (if you want to see some terrifying code, take a look at what the Windows Terminal does to handle ^C).
SEH is great for a lot of things, but some things are very hard. For example, libsegv on *NIX has some abstractions that let you catch segfaults and lazily provide a mapping, so that you can do things like userspace distributed shared memory and other fun hacks. This is not possible with SEH because code gets to handle exceptions from the leaf stack frames up, but you want to handle these things with some global code. VEH adds a more signal-like model, where you have a stack of handlers that are registered globally. These run before SEH handlers and have the option of either resuming execution or continuing to search for another handler. As far as I am aware (I’ve not looked at this bit of the code[2]), these are a pure userspace abstraction. The kernel just invokes the userspace unwinder and it checks for VEH handlers then SEH ones.
The C++ exception mechanism is layered on top of SEH (the SEH handler that you register is provided by the C++ runtime and does exception type checks before deciding to continue unwinding or resume from a handler). For Objective-C, we delegated to this and made Objective-C exceptions look like C++ ones. This involved some fun on 64-bit Windows. The C++ exception is supposed to embed the set of things that it matches. In C++, that’s a static list, but in Objective-C reflection APIs may change it dynamically. That’s fine, because we can construct it in the throw function. Unfortunately, the 64-bit ABI decided to save space and so expects all of the types to be 32-bit displacements from the library base. We ended up using alloca to allocate them all on the stack and then storing displacements from an arbitrary point on our stack frame.
The nice thing about SEH is that it gives a unified mechanism for handling kernel and userspace errors. It requires a bit of coupling between userspace and the kernel (the kernel expects userspace to have a valid stack pointer at all times and has to know the address of the SEH entry point - on *NIX the signal trampiline is injected by the kernel, via a VDSO or some more ad-hoc mechanism).
The SEH unwinder is generally nicer than the Itanium / DWARF model because it doesn’t need any heap allocations. The Itanium spec recommends that the C++ runtime has a small pool of buffers to use for allocating out-of-memory exceptions.
I don’t believe SEH can handle out-of-stack cases. I think the Windows ABI avoids this by doing a one-page-displacement stack probe on function entry and throwing an out-of-stack exception if it’s not possible to move the guard page down, so you always have at least one page for running exception handlers. This is probably cheaper than a signal stack, but means that you must be careful to limit stack usage in cleanups.
[1] On 32-bit x86, this used an on-stack linked list, on 64-bit x86 and Arm, it uses a table-driven unwinder similar to the DWARF model. The x86 model is faster to throw exceptions, the other model is faster if you do not throw exceptions. Since exceptions are supposed to happen only in exceptions situations, it makes more sense to optimise for the exceptional case.
[2] The bits of the code I did look at had some very interesting comments. There’s actually machinery hidden in the Windows userspace runtime bits for a full suite of Lisp-style exceptions (unwinding, resumable, and restartable). I’ve never seen anything use more than a tiny fraction of this.
I have no idea. That may involve promoting to the heap and is handled by the C++ part of the runtime. The generic SEH mechanism doesn’t know anything about subtyping.
I’ve run into similar bugs in the past and tracking down who’s responsible for fixing can be maddening. At one point the default dockerd ulimit for open files was set to some crazy high number, probably for people running cassandra on docker or something. My team was testing some stuff with the salt configuration management tool which used the python subprocess module and iirc specified that the child process shouldn’t inherit open file descriptors.
For whatever reason, at that time python chose to implement this feature by doing something like:
for i in range(get_max_files_ulimit()):
f = fdopen(i)
try:
f.close()
except:
pass
This meant python was looping over something like a million file descriptors and in particular case the try/catch exception handling and object construction was adding an order of magnitude more overhead than just the simply close() syscall.
It wasn’t exactly anyone’s fault, but due to weird interactions between dockerd’s default upstart file, salt, python subprocesses the result was that salt would spin for tends of seconds on startup just burning CPU.
Python ought to be using closefrom() to do this job. It’s reasonably portable - Linux, Solaris, most BSDs except Darwin - and implementing a looping portability wrapper in C would be a lot faster than the existing Python implementation.
[Original post author here] It’s worth noting that this post is very old, and the specific bug involved has since been fixed; as of Python 3.2, subprocess.Popen has a restore_signals flag (defaulting to True) that resets all the signals the Python interpreter has ignored back to SIG_DFL.
The general problem/gotcha of signal disposition being inherited and of the potential for mismatch in expectations, though, is very much still with us.
Strange signal dispositions are one of the classic blunders that setuid programs need to be aware of, much like strange file descriptors (especially those less than 3), strange rlimits, strange environment variables, etc. usw.
I didn’t realize that signals were inherited. It reminds me a lot of the fd limit issues. The correct “protocol” is that every single program should:
Bump its FD limit to the hard limit.
Not use select().
Before execing another program restore the limit to 1024.
Of course you can skip all of the steps if you know that your program will never use 1024 FDs. (But still please follow step 2 just in case someone execs you without lowering the soft limit.)
So I guess restoring the default signal mask should be added to this dance. Sounds like we need to have a template POSIX program that does all of these dances to avoid legacy footguns and play well with other programs.
You can finish that sentence with almost any fact about signal and most readers will agree that they didn’t know it either. Of the rest, the majority will agree that it’s a terrible design decision and, sadly, that there’s real-world library code that depends on it and so it’s hard to change.
I say Python is probably wrong here, nobody would expect spawning a child process from Python to work differently from spawning a child process any other way.
But it’s not as obvious as I’d want it to be… you could definitely argue that Python should follow the UNIX convention of making child processes inherit all kinds of stuff from its parent, including ignored signals, and if Python changed Popen to make it not inherit ignored signals, that too could cause subtle bugs.
POSIX signals are a gift that keeps on giving.
There has got to be a decent alternative. I hate to ask it, but what does Windows do?
Windows uses structured exception handling (SEH) or vectored exception handling (VEH) for this. SEH was covered by a bunch of patents owned by Borland and so *NIX systems didn’t copy it (they expired about 10 years ago, so probably could). There are three bits to SEH:
First, there are some C extensions so that languages that don’t have native support for exceptions can handle them. You use
__try
/__except
/__finally
in C to describe SEH blocks.Next, there’s an ABI and unwind library. Unlike the Itanium unwinder, SEH was designed to avoid needing any heap allocations. When an SEH exception is raised, the exception object is allocated on the stack and then passed down to the unwinder. The unwinder then walks up the stack[1] and, for each frame with cleanup (or catch) behaviour, it calls a funclet. This is a function that runs with a pointer to the stack frame for the function that it’s cleaning up. If the funclet is a catch, then it may move the exception object into the space reserved to hold it in the catching stack frame.
Finally, there’s support in the kernel to push a call into the library routine that starts this process on top of the stack. For anything that would be a signal on *NIX, Windows will throw an SEH exception (if you want to see some terrifying code, take a look at what the Windows Terminal does to handle ^C).
SEH is great for a lot of things, but some things are very hard. For example, libsegv on *NIX has some abstractions that let you catch segfaults and lazily provide a mapping, so that you can do things like userspace distributed shared memory and other fun hacks. This is not possible with SEH because code gets to handle exceptions from the leaf stack frames up, but you want to handle these things with some global code. VEH adds a more signal-like model, where you have a stack of handlers that are registered globally. These run before SEH handlers and have the option of either resuming execution or continuing to search for another handler. As far as I am aware (I’ve not looked at this bit of the code[2]), these are a pure userspace abstraction. The kernel just invokes the userspace unwinder and it checks for VEH handlers then SEH ones.
The C++ exception mechanism is layered on top of SEH (the SEH handler that you register is provided by the C++ runtime and does exception type checks before deciding to continue unwinding or resume from a handler). For Objective-C, we delegated to this and made Objective-C exceptions look like C++ ones. This involved some fun on 64-bit Windows. The C++ exception is supposed to embed the set of things that it matches. In C++, that’s a static list, but in Objective-C reflection APIs may change it dynamically. That’s fine, because we can construct it in the throw function. Unfortunately, the 64-bit ABI decided to save space and so expects all of the types to be 32-bit displacements from the library base. We ended up using alloca to allocate them all on the stack and then storing displacements from an arbitrary point on our stack frame.
The nice thing about SEH is that it gives a unified mechanism for handling kernel and userspace errors. It requires a bit of coupling between userspace and the kernel (the kernel expects userspace to have a valid stack pointer at all times and has to know the address of the SEH entry point - on *NIX the signal trampiline is injected by the kernel, via a VDSO or some more ad-hoc mechanism).
The SEH unwinder is generally nicer than the Itanium / DWARF model because it doesn’t need any heap allocations. The Itanium spec recommends that the C++ runtime has a small pool of buffers to use for allocating out-of-memory exceptions.
I don’t believe SEH can handle out-of-stack cases. I think the Windows ABI avoids this by doing a one-page-displacement stack probe on function entry and throwing an out-of-stack exception if it’s not possible to move the guard page down, so you always have at least one page for running exception handlers. This is probably cheaper than a signal stack, but means that you must be careful to limit stack usage in cleanups.
[1] On 32-bit x86, this used an on-stack linked list, on 64-bit x86 and Arm, it uses a table-driven unwinder similar to the DWARF model. The x86 model is faster to throw exceptions, the other model is faster if you do not throw exceptions. Since exceptions are supposed to happen only in exceptions situations, it makes more sense to optimise for the exceptional case.
[2] The bits of the code I did look at had some very interesting comments. There’s actually machinery hidden in the Windows userspace runtime bits for a full suite of Lisp-style exceptions (unwinding, resumable, and restartable). I’ve never seen anything use more than a tiny fraction of this.
There’s a very deep delve into the x86-64 EH support in this series of blog posts, in case you’re interested: http://www.nynaeve.net/?p=99
How does that work when the exception object is a derived class of the catch declaration?
I have no idea. That may involve promoting to the heap and is handled by the C++ part of the runtime. The generic SEH mechanism doesn’t know anything about subtyping.
It looks like this has been fixed and now
Popen
has a parameterrestore_signals
that defaults to true.I’ve run into similar bugs in the past and tracking down who’s responsible for fixing can be maddening. At one point the default
dockerd
ulimit for open files was set to some crazy high number, probably for people running cassandra on docker or something. My team was testing some stuff with thesalt
configuration management tool which used the python subprocess module and iirc specified that the child process shouldn’t inherit open file descriptors.For whatever reason, at that time python chose to implement this feature by doing something like:
This meant python was looping over something like a million file descriptors and in particular case the try/catch exception handling and object construction was adding an order of magnitude more overhead than just the simply
close()
syscall.It wasn’t exactly anyone’s fault, but due to weird interactions between dockerd’s default upstart file, salt, python subprocesses the result was that salt would spin for tends of seconds on startup just burning CPU.
Python ought to be using
closefrom()
to do this job. It’s reasonably portable - Linux, Solaris, most BSDs except Darwin - and implementing a looping portability wrapper in C would be a lot faster than the existing Python implementation.It’s also in libbsd, for platforms that don’t have a native one.
And gnulib, ditto :-)
[Original post author here] It’s worth noting that this post is very old, and the specific bug involved has since been fixed; as of Python 3.2,
subprocess.Popen
has arestore_signals
flag (defaulting toTrue
) that resets all the signals the Python interpreter has ignored back toSIG_DFL
.The general problem/gotcha of signal disposition being inherited and of the potential for mismatch in expectations, though, is very much still with us.
Strange signal dispositions are one of the classic blunders that setuid programs need to be aware of, much like strange file descriptors (especially those less than 3), strange rlimits, strange environment variables, etc. usw.
I didn’t realize that signals were inherited. It reminds me a lot of the fd limit issues. The correct “protocol” is that every single program should:
select()
.exec
ing another program restore the limit to 1024.Of course you can skip all of the steps if you know that your program will never use 1024 FDs. (But still please follow step 2 just in case someone execs you without lowering the soft limit.)
So I guess restoring the default signal mask should be added to this dance. Sounds like we need to have a template POSIX program that does all of these dances to avoid legacy footguns and play well with other programs.
https://0pointer.net/blog/file-descriptor-limits.html
You can finish that sentence with almost any fact about signal and most readers will agree that they didn’t know it either. Of the rest, the majority will agree that it’s a terrible design decision and, sadly, that there’s real-world library code that depends on it and so it’s hard to change.
I ran into that issue same issue really messing with performance.
https://github.com/saltstack/salt/issues/18569
It doesn’t seem like it was ever really fixed because individually neither salt nor dockerd felt like they were doing anything wrong.
Yikes, that is subtle! I’m not sure whether Python’s behaviour is necessarily wrong, but it certainly is no fun having to debug something like this.
I say Python is probably wrong here, nobody would expect spawning a child process from Python to work differently from spawning a child process any other way.
But it’s not as obvious as I’d want it to be… you could definitely argue that Python should follow the UNIX convention of making child processes inherit all kinds of stuff from its parent, including ignored signals, and if Python changed Popen to make it not inherit ignored signals, that too could cause subtle bugs.
The link at the beginning is dead so here’s alternate link https://web.archive.org/web/20111108133122/https://ebroder.net/2010/01/25/complex-systems-and-simple-failures/