1. 19
  1.  

  2. 9

    Signals are the most exciting part of the POSIX world. This post missed out my favourite bit though: signal handlers also receive a ucontext_t, which contains a complete register dump of the thread that was interrupted by the signal. You can use this for all sorts of exciting things. I’ve written code that handles a signal on system calls where the kernel does not allow the syscall, extracts the arguments, and then does an RPC to a more privileged process to ask it to perform the system call based on a dynamic policy, then injects the success / failure return value back into the signal’s ucontext_t. Because this is C++ and not C, it’s all driven by template instantiation that figures out the registers to pull out based on the argument types of the function that does the RPC. I imagine you could probably do the same thing in Rust.

    1. 4

      Wow, I was unaware the ucontext_t was usefully mutable… that’s a little terrifying.

      1. 2

        For a joke I once wrote a signal handler that increments the instruction pointer on segfault and illegal instruction signals / SIGSEGV and SIGILL. https://github.com/RichardBarrell/snippets/blob/master/no_crash_kthxbai.c It sometimes even works.

        You can also implement less-silly things with SIGSEGV handlers, like I think at least one clustering system might have used SIGSEGV handlers to present the illusion of shared memory across a network in userspace.

        1. 2

          I think at least one clustering system might have used SIGSEGV handlers to present the illusion of shared memory across a network in userspace

          The TreadMarks system (pdf) from Rice University in the late ‘90s did that (see p. 15 for a description). At least one commercial product was based on that scheme, Intel’s Cluster OpenMP (I think now discontinued). Here’s an old Intel whitepaper (also pdf), with a description of the mechanism on pp. 5-6.

          1. 2

            I’ve thing I don’t understand about these schemes: how did they handle memory pressure? If I have a page in the application that’s just a read-only mirror of a page from another machine in the cluster, it should be okay to drop it at any time, just like how it’s okay to drop any unmodified page of memory containing filesystem cache. You just need the page to generate another segv if it’s accessed again after being dropped. But from the kernel point of view, it’s application data that can’t be safely stored e at all.

            It doesn’t look like there’s any way to tell Linux that pages can be unmapped safely with madvise(). All the options I see in the man page (like MADV_DONTNEED) give you zero filled pages rather than segfaults if accessed again after being dropped.

            1. 2

              You can mmap anonymous memory over a page to discard it forcibly. The portable idiom is to map anonymous memory with PROT_NONE over the region. You’ll then get a trap. For a distributed shared memory thing, you first map the page PROT_READ, then ensure any modifications are written to other machine(s), then map PROT_NONE to discard it. This gives you sufficient to implement MESI:

              • Initially pages are PROT_NONE (Invalid state).
              • When you take a SEGV, fetch the data and insert it with PROT_READ[1] (shared state).
              • When you take another SEGV, broadcast invalidate the page on all other machines and mark it PROT_READ | PROT_WRITE locally (briefly exclusive state, exclusive one instruction after the signal handler returns).
              • After a while (or after you receive messages from other nodes trying to put it into shared state), mark it PROT_READ and send the data to other requesters (back to shared state).
              • If you are in shared state and receive an invalidate, mark it PROT_NONE (ideally mapping a new anonymous page PROT_NONE to free up physical memory) and transition back to invalid state.

              False sharing is always a problem with this kind of system. If two nodes are writing to objects that happen to be on the same page then they will keep flipping the entire page between modified and invalid state and become really slow. Most multithreaded software tries to avoid false sharing at the cache-line granularity but avoiding it at the page granularity is much harder.

              For general-purpose use, these things also suffer from reliability issues. If you’re running a 512-node cluster with a distributed shared memory system, you are 512 times more likely to encounter a catastrophic failure than if you have one node of that cluster. Most modern distributed systems try to explicitly reason about failure and anything that makes sharing transparent makes this impossible.

              [1] To do this atomically, you need to create an anonymous memory object (or a[n unlinked] file if your OS doesn’t support anonymous memory objects), mmap it read-write somewhere, populate it, and then map it read-only in its destination, at which point you can discard the read-write mapping if you want.

              1. 2

                The reason I’m wondering about madvise is that this scheme you’ve doesn’t give the kernel the ability to (unilaterally) discard pages for which the program has a read-only copy that it knows it isn’t the sole holder of, in response to memory pressure.

                The application has to voluntarily give them up, and it may not know that it needs to (and memory pressure may have put the system into a state where the kernel doesn’t want to or can’t schedule the application).

                I know you can implement mesi in userspace, what I’m wondering is can you implement swap

                1. 2

                  The reason I’m wondering about madvise is that this scheme you’ve doesn’t give the kernel the ability to (unilaterally) discard pages for which the program has a read-only copy that it knows it isn’t the sole holder of, in response to memory pressure.

                  There are similar mechanisms. MADV_FREE lets the kernel discard the page and replace it with a page of zeroes. You want a scheme that lets the kernel discard the page and replace it with a no-access mapping that you could then replace. I think XNU has a mechanism like this that iOS uses for some cache infrastructure so that the OS can discard pages that the application can recreate.

                  I know you can implement mesi in userspace, what I’m wondering is can you implement swap

                  Yes - I did almost 20 years ago and I wasn’t the first one. There are three aspects to userspace swapping:

                  • Can you discard pages and get a trap on access that you can fix up?
                  • Can you remove and replace a page without doing anything that would cause a page fault?
                  • Can you get useful triggers that tell you when you need to swap?

                  Of the three, the first is fairly easy with POSIX, the second is difficult but possible, the third is impossible in portable code. Android and XNU have mechanisms for delivering various low-memory notifications to userspace, upstream Linux and other *BSDs don’t. Windows does, but it isn’t delivered at a useful time (it’s fairly easy to get into a situation where the kernel refuses to allocate more memory for userspace but the low-memory notification isn’t delivered). You can have some per-process threshold after which you start swapping, but memory pressure is a global property and so local optimisation is not a good solution.

                  1. 1

                    There are similar mechanisms. MADV_FREE lets the kernel discard the page and replace it with a page of zeroes

                    I was discussing that & why it’s unsuitable in the post you replied to above? This comes off as slightly rude. :(

                    I did almost 20 years ago and I wasn’t the first one

                    Nice! Which mechanism did you use to get pages discarded under low memory conditions?

                    Can you get useful triggers that tell you when you need to swap?

                    I assume in practice people must have tuned userspace swap systems to start discarding pages long before getting close to a hard OOM, since the kernel might need to allocate pages in order to be able to deliver a notification to & schedule the userspace process that is holding onto the memory.

                    1. 2

                      There are similar mechanisms. MADV_FREE lets the kernel discard the page and replace it with a page of zeroes

                      I was discussing that & why it’s unsuitable in the post you replied to above? This comes off as slightly rude. :(

                      Sorry, I didn’t meant to imply that you weren’t aware of this mechanism but to highlight that the change from MADV_FREE to what you want is fairly small. The implementation of MADV_FREE has to change the permission from read-write to read-only. Changing that to no-access instead and not marking the new page as CoW would be quite a small kernel change.

                      Unfortunately, signals are really bad for composition. You really want the OS to have a different signal handler for each memory object. Windows has a mechanism something like this. Otherwise, you need to rely on every signal handler invoking the one that was returned in the sigaction call if it can’t handle it.

                      I did almost 20 years ago and I wasn’t the first one

                      Nice! Which mechanism did you use to get pages discarded under low memory conditions?

                      I used a per-process in-core limit. I did hack in something to FreeBSD to expose to userspace one of the thresholds that’s used to poke the pager via a signal, but it didn’t work well. I also played a bit with L4 Hurd, which had some really nice mechanisms for doing this, but I was only able to run it in a VM.

                      Can you get useful triggers that tell you when you need to swap?

                      I assume in practice people must have tuned userspace swap systems to start discarding pages long before getting close to a hard OOM, since the kernel might need to allocate pages in order to be able to deliver a notification to & schedule the userspace process that is holding onto the memory.

                      On XNU, as I recall, there are three different levels of trigger. There’s the polite ‘please delete caches’ notification, there’s an aggressive ‘the OOM killer will kick in soon’ and then there’s the OOM killer (which starts by killing processes that have opted in to sudden termination.

                      Part of the problem here is thundering herds: in the kernel, there’s typically a single pager (maybe a small number for different types of memory). In userspace, any number of processes can register for low-memory notifications. If they all need to allocate a small amount of memory to free a large amount and they all wake up at once, they end up making everything much worse.

                      1. 1

                        Unfortunately, signals are really bad for composition. You really want the OS to have a different signal handler for each memory object. Windows has a mechanism something like this. Otherwise, you need to rely on every signal handler invoking the one that was returned in the sigaction call if it can’t handle it.

                        Agreed. It’s the same problem as any kind of global state like cwd, I guess.

                        I’m of the opinion that catchable asynchronous signals are an unfortunate historical design full stop. ;)

                        You could make up a convention in user space where e.g. everyone uses a single library that maintains a hash table, but meh.

                        AIUI segfaults are so slow, it’s likely that anyone who cares enough to benchmark would rather use a btree/hash and do the lazy computation with explicit function calls.

                        I mean how often are we going to do “get everyone to agree on a convention for signal handlers” but not “get the like three programmers who access this array to call get_it(i) instead of array[i]”?

                        In userspace, any number of processes can register for low-memory notifications. If they all need to allocate a small amount of memory to free a large amount and they all wake up at once, they end up making everything much worse

                        Oof.

                        I wonder if the kernel could (or maybe does?) mitigate this by delivering the low memory notifications to each process in decreasing order of their memory usage, leaving say a dozen milliseconds between?

                        (And keep a flag for “has been sent a low memory notification” in the process info struct so this doesn’t need a dynamic memory allocation to avoid notifying any process repeatedly when the order changes while the list is being traversed. 😅)

                        I used a per-process in-core limit.

                        In all honesty this is probably the best / most predictable solution for a lot of workloads. e.g. with MySQL you tell it up front how much memory to use, so you pick a number based on how much you have and how much slack you need left for other stuff on the same box. I’ve never met anyone whose day was ruined by this scheme.

          2. 2

            You can also implement less-silly things with SIGSEGV handlers, like I think at least one clustering system might have used SIGSEGV handlers to present the illusion of shared memory across a network in userspace.

            Yeah that sort of thing seems “normal”, the fiddling with the ucontext_t just by… fiddling with it, with no need to do something like setcontext/swapcontext or longjmp is what caught me off guard.

            1. 2

              It is weird but kind of makes sense: in the signal handler context, the thread is stopped and its register state has been stashed somewhere in memory, so you can just fiddle that memory like any other data structure.

              Whereas in normal contexts the register state is in the registers that your C code is using, so you need something like setjmp that’s written in assembly and/or is a built-in.

              1. 1

                Yeah, D has an option to enable this too. It isn’t done by default though because it has its weird platform quirks and running it in a debugger is frequently more useful anyway.

                Another experiment the D people did was to mprotect things while garbage collecting, then do a kind of userspace page in on the segv signal. Turned out slower than just using the pause all threads global lock approach though, but it was still kinda cool being able to do a job typically done by the kernel.

                1. 1

                  IME most complex runtimes (i.e ones that have a JIT) do. Mono for instance does. It turns out to be a problem when I was porting it to AIX, because AIX doesn’t have an unmapped null page - you can just….read zeroes back from null. AFAIK this was either a compat or a performance thing. Instead, I had to turn on the usually debug-only emitted-instructions null checks for the platform.

                  1. 1

                    There’s nothing in .NET that requires that null have a zero bit pattern. I wonder how much harder it would have been to keep this JIT mode, map a page somewhere in the address space as no-access and then use its address as the null pointer representation. I guess you get quite a bit of speedup from the fact that compare-against-0 is a heavily optimised instruction in most CPUs, so this may end up being slower.

                    1. 1

                      I think keeping null == 0 is probably easier for FFI/mapping to HW better, yeah.

                2. 1

                  I’ve actually done that, not as a joke. This code is in the CHERI test suite, which checks that you get traps for the operations that you expect. The individual tests can try to do a load or store that should fault and then check if a counter is incremented by the signal handler. The same tests can also run bare-metal with the equivalent logic in the interrupt service routine, but when running on an OS the kernel delivers a signal to userspace in response to the trap.

                  During my PhD, I wrote some code for out-of-core data management that caught SIGSEGV, fetched the data and populated the faulting page. That never went anywhere because it was noticeably slower than other approaches.

                  1. 1

                    context->mc_pc += 4;

                    Ah, bless that simple uncomplicated RISC instruction encoding scheme. ❤️ I was wondering where the other 99% of the decoding code must be until I looked up and saw that it is MIPS!

                    During my PhD, I wrote some code for out-of-core data management that caught SIGSEGV, fetched the data and populated the faulting page.

                    There’s a Linux thing I remember hearing of that’s supposed to use the hypervisor support to let an “ordinary” process use the hardware page fault mechanism directly, and I think it makes applications that use faults for implementing semantics several times faster. But I’ve forgotten the name of it and damned if I can find it again.

                    1. 2

                      context->mc_pc += 4;

                      Ah, bless that simple uncomplicated RISC instruction encoding scheme. ❤️ I was wondering where the other 99% of the decoding code must be until I looked up and saw that it is MIPS!

                      The decoder is simple. The logic for branch delay slots is fun. If you take the exception in a branch delay slot, you need to identify that you’re in the delay slot and then advance the PC to the instruction that’s the branch target. Fortunately, MIPS has only one delay slot and so if the instruction in the delay slot traps, it won’t have been able to modify any of the registers that the branch instruction used.

                      During my PhD, I wrote some code for out-of-core data management that caught SIGSEGV, fetched the data and populated the faulting page.

                      There’s a Linux thing I remember hearing of that’s supposed to use the hypervisor support to let an “ordinary” process use the hardware page fault mechanism directly, and I think it makes applications that use faults for implementing semantics several times faster. But I’ve forgotten the name of it and damned if I can find it again.

                      I saw some references to that in the code a couple of weeks ago, but then couldn’t think of sufficiently similar search terms to find it when I wrote this. If anyone knows what it’s called, please reply here!

                      1. 1

                        The logic for branch delay slots is fun.

                        Honestly that doesn’t take much away from the elegance of it, you only have a handful of lines for handling the BD case there. :)

            2. 1

              typo: Reading data out of signinfo_t

              1. 1

                In my opinion:

                The best way of handling POSIX signals, if you want to handle them at all, is to do so synchronously (even though the signals themselves are technically asynchronous). That means using an event loop that also supports detecting signal delivery.

                (I have written such an event loop - http://davmac.org/projects/dasynq/ - in C++)

                The act of catching signals, and retrieving the siginfo_t data, and not discarding any signals is quite difficult to manage properly. I probably should do a blog post some time, but the key points are:

                • most APIs which allow early termination (interruption) by signals actually execute the signal handler -
                • which means you can’t use sigwaitinfo to extract the siginfo_t data synchronously; it’s already gone once the signal handler returns
                • so, you have to store the data (from the signal handler) and retrieve it afterwards
                • but, as this post also notes, there’s a lot of things you can’t do from a signal handler, including allocation. So the storage has to be pre-allocated;
                • which means you must make sure only one signal is handled, otherwise you may lose the data as one call to a signal handler overwrites the data stored by the previous call

                This all turns out to be difficult to do in practice and really difficult to do portably. To the extend that I ended up actually jumping out of the signal handler (via siglongjump, which restores the signal mask and thus avoids another signal handler being executed) to make sure I could get the signal data.

                1. 2

                  even though the signals themselves are technically asynchronous

                  This is half true. Some signals, such as SIGIO, SIGINT, SIGUSR1, and so on are asynchronous. They are unordered with respect to anything else in the system and it’s fine to mask them and check them in your run loop (kqueue has explicit events for this, I think Linux’s signalfd can be used in the same way).

                  Others, such as SIGPIPE, SIGILL, SIGFPE, SIGSEGV, are synchronous. They are triggered by a specific instruction or system call. Of these, most need to be handled synchronously because the thread that triggered them cannot make progress without handling the condition. For example, SIGILL requires you to skip the offending instruction, emulate it, or kill the thread. SIGSEGV may require you to update page mappings to continue. You cannot defer these to your next run-loop iteration because you cannot get to the run loop without fixing the faulting instruction. SIGPIPE is something of an outlier here: you can make progress, but you probably want to handle the error at the point of the system call that triggered it (and not in the signal handler, which in most cases is just an annoying hold-over from ancient UNIX).

                  If you do defer handling of the signal, then you will not have access to the ucontext_t that is delivered with the signal (there is no even vaguely portable way of capturing this and on x86 the size of the signal frame can vary quite significantly between microarchitectures and even if you do capture it then it is full of values that refer to things that no longer exist). This means that you can’t do any of the interesting things that signals allow. Even capturing the siginfo_t can be dangerous because it can refer to file descriptors or POSIX IPC identifiers that are guaranteed to be live for the duration of the signal handler but may not exist by the start of the next run-loop iteration (or, worse, exist as identifiers but refer to different objects).

                  It’s quite unfortunate that signals are a single mechanism used for both synchronous and asynchronous events. Spilling the ucontext_t is quite expensive for the kernel and it’s almost never used for asynchronous events (it’s useful only when you want to do something like userspace threading). Similarly, the various mechanisms for polling for signals are completely useless for synchronous signals.

                  1. 1

                    SIGPIPE, SIGILL, SIGFPE, SIGSEGV, are synchronous. They are triggered by a specific instruction or system call. Of these, most need to be handled synchronously

                    Sure, but that’s not what the original article was discussing - capturing signal info and feeding it back to the main thread. These synchronous signals are the exception rather than the rule; most programs don’t handle them at all, though of course there are reasons for doing so.

                    1. 2

                      most programs don’t handle them at all, though of course there are reasons for doing so.

                      Most programs don’t handle any signals. I don’t have a representative sample set of the ones that do, but from the ones that I’ve seen (which tend to be language runtimes, emulators, or security features, so a fairly skewed sample set) it’s almost always the synchronous ones that people care about. The asynchronous ones were important 15 years ago, but there are now usually other mechanisms that are less painful to use for getting the same information. Any high-performance I/O-heavy system I’ve worked on has started by disabling signals.

                      1. 1

                        Most programs don’t handle any signals

                        But, again, there are some that do, and that’s what the article was discussing. I’m aware of at least a few that handle SIGINT to do a clean shutdown, and a number of daemons (smbd and sshd, for example) are receptive to SIGHUP (often as a “reload configuration” message). The dd utility responds to SIGINFO. Those are just examples I can think of off the top of my head. I’d be surprised if there weren’t a fair few programs that use SIGCHLD to detect child process determination as well, though it’s not always necessary.

                        I understand there are programs that care about the synchronous signals, but I think it’s mostly particular types of program that do (as you say: skewed sample set), and I don’t think the article was talking about these synchronous signals (it’s using SIGTERM as the example, and the technique it’s talking about isn’t useful for the synchronous signals, it loses the ucontext_t information).

                    2. 1

                      Oh and:

                      It’s quite unfortunate that signals are a single mechanism used for both synchronous and asynchronous events. Spilling the ucontext_t is quite expensive for the kernel

                      Linux’s signalfd does solve that; you don’t need to let a signal handler run to dequeue the signal. On BSDs with kqueue you can detect the signal and then use sigwaitinfo or sigtimedwait (the latter is actually necessary, with a zero timeout, to avoid waiting spuriously for a signal which was delivered but not queue, eg a signal is collapsed into an already pending signal). Unfortunately OpenBSD doesn’t have sigtimedwait, and the Mac OS implementation of kqueue has (or at least had) a bug where kqueue sometimes reported a signal some time before it was queued. It’d certainly be nice if there was a standardised mechanism.