1. 40

    1. 9

      I am not totally convinced by the security argument that they make, but I’d really like to remove syscall from the FreeBSD kernel. I only recently learned that it is implemented in the kernel. This complicated the kernel implementation and it causes some pain for anything in userspace that wants to trap and handle system calls that are blocked by some policy (you have to handle both the direct and indirect case) and various bits of auditing. Particularly on *BSD, where system call stubs are machine generated from syscalls.master, it would be trivial to do this in userspace. Given that most code doesn’t actually need to do system calls like this (again, because system calls all have auto-generated C ABI wrapper, you don’t have things like the GNU Linux stupidity where the futex syscall doesn’t have a C wrapper), moving the thing that does indirect dispatch to a separate libsyscall that you link only if you need that would reduce the attack surface, without breaking things like go.

      1. 2

        I don’t know much about FreeBSD at all, so please excuse if this is a silly question:

        What does it mean that it is implemented in the kernel? Does that mean there is a “syscall” syscall? And that then calls (within the kernel) the correct handler for the inner syscall?

        1. 2

          Yes, exactly. The first argument is the syscall number and the subsequent arguments are forwarded to the correct one.

          1. 4

            That is weird indeed! I can see why you’d want to get rid of it and do the indirection in a library instead.

            Also: can you do several layers of indirection like syscall(SYS_SYSCALL, SYS_SYSCALL, SYS_DUP, 5)?

            1. 2

              I only looked at the code briefly when fixing some of the things to emulate system calls in sandboxed processes (which requires grabbing the arguments out of the signal’s ucontext). I believe there’s a single layer of indirection at the boundary, so I’d expect an error from that (try it and see! You might find a bug…).

              On *BSD systems, the system call calling convention is the function call calling convention (except on x86-64, due to a register conflict) with the syscall number in an extra register[1] so that you can just call the trampoline and it doesn’t need to rearrange arguments. Only the architecture-specific code at the boundary knows how to read the arguments (some may be in registers, some on the stack). System calls are defined in syscalls.master (or one of the compat-layer equivalents) and dispatch goes via three steps (most of the first step is machine generated code from the type signature of the function):

              1. The arch code constructs a struct from the arguments.
              2. The sys_ function (e.g. sys_open) is called with these arguments.
              3. This then calls the kern_ function (e.g. kern_open) that actually implements the system call.

              Roughly, the first step handles anything related to a particular calling convention, the second relative to a specific ABI, and the third is generic. If you are running 64- and 32-bit x86-64 FreeBSD processes, their syscalls will go via different architecture-specific paths to build the structures. Then the second step is code that is common to either all 64-bit or all 32-bit processes. The final step is shared across all ABIs and architectures. The same model applies to compatibility ABIs. The x86-64 and Arm architcture-specific code has paths for handling the Linux system calling convention. These then dispatch to the same paths at the second step, with the AArch64 and x86-64 code sharing the same implementations. These then either dispatch to the native implementations for system calls that exist in FreeBSD or to something in the Linux emulation layer for things like epoll that don’t have direct equivalents.

              All of this means that there can’t be a sys_syscall, because arguments have been copied into the kernel by that time. Looking in a couple of trap.c implementations, they aren’t doing the check in a loop. There is an entry for syscall in syscalls.master, but it’s marked as special.

              [1] It’s actually slightly more fun than that, in a way I didn’t find documented anywhere but that I want to copy elsewhere. On architectures with condition codes (anything modern except RISC-V), the carry flag is set on return from a syscall to indicate that the value in the return register is an errno value. This lets libc do a single branch on carry, set errno, and then set the return value to -1. Branch on carry is normally statically predicted as not taken, so doesn’t consume branch predictor state in the common case.

              1. 2

                On architectures with condition codes (anything modern except RISC-V)

                I have to take issue with that!

                No clean sheet ISA designed for high performance since 1990 [1] has had traditional condition codes. This includes POWER, Alpha, Itanium, and RISC-V.

                Alpha and RISC-V (like MIPS) simply don’t have condition codes. If you need the result of a comparison to persist then you save it in a GPR.

                IBM POWER (1990) recognised that traditional condition codes are problematic. It tried to solve the problem by having eight sets of condition codes (except carry and overflow, which there is only one of, in xer). I think this has to be regarded as a failed experiment – at that point you might as well use SETcond on GPRs as CMP CRn.

                Itanium had 64 1-bit predicate registers. So a similar solution to POWER, though each of POWER’s 8 sets of condition codes was somewhat traditional (at least following S/360, rather than PDP-11) in having individual bits for LT, EQ, GT and Summary Overflow.

                Although Aarch64 is newer than 1990, and intended for high performance, it’s not clean sheet as it carries many things over from 32 bit Arm, perhaps because for the first decade or so it’s had to so-exist in the same CPU cores as Aarch32.

                [1] so leaving out microcontrollers such as MSP430 (reworked PDP-11)

    2. 8

      Does anyone have a strong feeling about this actually improving security? I’m seriously skeptical, but I haven’t given it a ton of thought. Any exploit devs care to comment?

      1. 8

        Well, the obvious answer is that it would lower the surface area a little bit, right? Instead of worrying about libc and syscall(2) you just worry about libc. Whether that improves security I don’t know but considering the resources OpenBSD has, it is one less thing to worry about or deal with.

      2. 7

        I feel the security benefits are theatre, but it enables better backwards compatibility in the long run - Windows and Solaris for example, have had good dynamic library based compatibility for years. Of course, OpenBSD doesn’t care about that part…

      3. 2

        If you can’t run software written in one of the most popular memory safe languages anymore, then maybe that would be bad for security.

        1. 7

          The Go port will be fixed as it was before. How do you draw the conclusion that Go isn’t going to be supported? You didn’t read the post.

        2. 1

          Right but I’m trying to understand if this mitigation would actually make exploitation of an existing vulnerability difficult. It feels like a mitigation without a threat model.

          1. 3

            Go read up on ROP and stack pivots, especially on amd64 and variable length instruction architectures that make it impossible to completely remove ROP gadgets. There are very clear threat models already defined based on arbitrary code execution, especially remotely. Reducing the syscall surface area as much as possible minimizes the success probability.

            1. 4

              I’m surprised no one has yet decided to use a separate stack for data, especially on x86-64 with more registers, a register parameter passing paradigm and a larger memory space, leave RSP for CALL/RET and use another segment of memory for the “stack frame”. That way, overwrites of the data stack won’t affect the return address stack at all. Given how fast 32-bit systems have largely disappeared on the Internet, I think such an approach would be easier (and faster) than all the address randomization, relinking, stack canaries, et. al.

              Or (as painful as this is for me to say), just stop using C! Has anyone managed to successfully exploit a program in Rust? Go?

              1. 4

                A similar feature is called “shadow stacks” - return addresses get pushed both to the standard stack and a separate return stack, and the addresses are checked to match in the function epilogue. Its supported in all the big C compilers. I can’t speak to how often it’s actually used.

                Further afield, Forth also exposes fully separate data and return stacks. So it’s been done.

                As far as performance goes, you’re losing an extra register for the other stack, which can be significant in some cases, and also memory locality. Cost varies but has been measured around 10%.

                1. [Comment removed by author]

              2. 1

                In addition to the safe stack / shadow stack work, it’s worth pointing out SPARC. SPARC had a model of register windows arranged in a circle. You had 8 private registers, 8 shared with the caller and 8 shared with the callee (you could reuse any caller-shared one you weren’t using for return and all callee-shared ones between calls). The first S in SPARC stood for ‘scalable’ because the number of windows was not architecturally specified. When you ran out, you’d trap and spill the oldest one (you should do this asynchronously, but I don’t believe the implementations that typeid ever shipped). This meant that the register spill region had to be separate from the stack. This gave complete protection from stack buffer overflows turning into ROP gadgets.

                Pure software variants have been tricky to adopt because they’re incredibly ABI disruptive. Anything that creates stacks needs to allocate space. Anything that generates code needs to preserve an extra register (not just across calls but also when calling other functions). Anything that does stack unwinding needs to know about them.

                If you’re compiling everything together and static linking, it’s feasible.

                1. 1

                  I know about the register windows on the SPARC, but I never really dove into how it interacted with the operating system with regards to context switches (process or threads)—it seems like it could be expensive.

                  1. 1

                    Switching threads was very expensive. A large SPARC core could have 8+ register windows, so needed to save at least 64 registers. That’s fairly small in comparison with a modern vector extension, but still large.

                    On later superscalar designs, it actually wasn’t that bad. Modern processors allocate L1 lines on store, so spilling a full cache line is quite cheap. If you’re doing this often, you can even skip the store buffer and just write directly from registers to the cache line. I think switching from a thread required spilling all used register windows, but resuming a thread just required reading back the top and then the others could be faulted in later. SPARC had very lightweight traps for this kind of thing (and TLB fills - 32-bit SPARC had a software-managed TLB, though later SPARCs were spending 50% of total CPU time in that trap handler so they added some hardware assist with 64-bit versions).

                    I think the biggest mistake that SPARC made was making the register window spill synchronous. When you ran out of windows, you took a synchronous fault and spilled the oldest one(s). They should have made this fully asynchronous. Even on the microarchitectures of the early SPARCs, spilling could have reused unused cycles in the load-store pipeline. On newer ones with register renaming, you can shunt values directly from the rename unit to a spill FIFO and reduce rename register pressure. I think that’s what Rock did, but it was cancelled.

            2. 1

              ROP isn’t a statistical attack, so this talk of probability is confusing.

              1. 4

                Have a look at Blind ROP: https://en.wikipedia.org/wiki/Blind_return_oriented_programming

                When you don’t have complete information of the running program, these automated techniques will operate with a probability of success or failure.

                1. 1

                  But nothing about this mitigation is unknown or randomized further, as far as I can tell. I don’t see how brop is important here or how it would be impacted by this. Maybe the attacker needs to be a bit pickier with their gadgets?

                  1. 3

                    Any ROP technique needs to find and assemble gadgets. This would remove one possible type of gadget, making it harder to achieve arbitrary syscall execution especially in light of other mitigations like pledge(2) or pinsyscall(2).

      4. 1

        Assuming they do this to all interesting syscalls, it would make shellcode writing a bit more painful as now you actually have to deal with ASLR to find the libc versions. That said, ASLR isn’t a significant barrier in 99% of cases so its not going to combat anything targeted or skilled attackers. However it seems it would also disallow static linking libc, which is a huge con for such minor gain IMO.

        Disclaimer: Its been almost a decade since I’ve attacked an OpenBSD machine on the job, so there may be additional protections im not aware of that make this change a more valuable protection.

        1. 2

          FWIW, OpenBSD relinks libc (and OpenSSH, and the kernel) on each boot. So defeating ASLR on OpenBSD may require more than finding one offset.

    3. 5

      Another frustrating move, resulting in almost all software having a C dependency directly or indirectly.

      And as far as I know, no modern compiler implements an open specification for C either.

      1. 6

        Yeah, being security-obsessed on one hand but doubling and trippling down on C on the other feels weird to me.

        1. 5

          There is no difference in safety between the system/c ABI and a raw syscall abi. If anything the C version is safer, because as anemic as the C type system is, it at least has one.

          1. 17

            And, in particular, POSIX specifies the C interfaces to system calls and carefully avoids specifying where the kernel interface is. You could, for example, use handles in the kernel and move all of the file descriptor table code into userspace.

            There’s no requirement that this be implemented in C (LLVM libc is C++, ReduxOS’s libc is Rust), but it must expose things with the C type system.

    4. 3

      There’s going to be some fallout which takes time to fix, especially in the “go” ecosystem.

      Do you have any insight as to why there will be fallout for go, or how they’re likely to fix it?

      1. 9

        The go runtime seems to use syscall(2) extensively on OpenBSD: https://github.com/golang/go/issues/59667

      2. 8

        Go famously tries to perform raw system calls on every platform, even though that’s pretty much only supported on linux.

        IIRC Go was supposed to start going through libc on openbsd a few versions back after OpenBSD deployed syscall origin verification, but I believe openbsd provided a temporary stay of execution for statically linked executable so maybe that didn’t happen? Or maybe they just moved everything over to syscall(2)? Or lots of packages in the ecosystem, call syscall.Syscall?

        1. 4

          Didn’t they give up on raw syscalls on Mac? That just leaves Windows and other Unixes.

        2. 3

          I get the sense that syscall.Syscall() is regrettably common for ioctl() calls in particular, at least in part because the Go folks have historically refused to add an untyped ioctl wrapper.

          1. 1

            Yeah, memory safety is a thing they care about.

            1. 13

              Only enough to make it tedious: using the indirect syscall wrapper allows you to make any totally unsafe system call? Providing a first class ioctl() wrapper would not be any more or less unsafe, merely more readily portable.

    5. 2

      What if my application contains code that simulates syscall(2)? i.e. A big switch that dispatches arguments to various libc functions based on some input enum. Does that nullify the benefits of removing syscall() from libc?

      1. 4

        Of course not, you’re doing whatever you like in userspace, just like the rest of your program does. You’ll still enter the kernel using the approved single interface via libc, including its validity checks. Creating a my_syscall() function in your program doesn’t let you bypass those checks.

        1. 2

          The original syscall() also had those same checks. The only difference was that it allowed you to dispatch a syscall via an argument, so it doesn’t seem like there is a meaningful difference. my_syscall() ends up providing the same potential ROP gadget as the removed syscall(). I’m guessing there is a more subtle detail being missed.

          1. 4

            I assume the argument is that if you really need it, you can provide the emulated user space implementation and aren’t any worse off than before. If you don’t need it, you don’t need to ship that potential attack surface.

    6. 1

      Does this mean you couldn’t use alternative C libraries like musl or cosmopolitan (statically linked!) anymore ? Granted I don’t really use OpenBSD, but I am still interested. The idea could gather momentum…

    7. 1

      This is only removing the libc interface to do an arbitrary syscall, right? The syscalls themselves are still implemented in assembly with something like

      mov eax,1

      Or with a int 0x80. Right? So if an attacker can still do this, what’s the advantage?

      1. 13

        OpenBSD blocks syscalls not made through libc

        1. 4

          Not quite. OpenBSD blocks system calls made from pages that are not explicitly marked as valid sources from system calls. Most of libc is not allowed to make system calls (if you find a weird bit of x86 machine code that looks like a syscall instruction if you jump into the middle of it, it still can’t be used to make system calls). If you have your own system call dispatcher, you can mark the page that contains it in the same way.

          1. 2

            Right, assuming your program isn’t being runtime linked by ld.so(1), so in practice for programs in base this isn’t possible after calling main.

      2. 12

        From what I understand the kernel will check the origin of the syscall and if it’s outside of where it expects it to be it’ll just refuse to service it. Therefore the only way to actually perform the syscall is to go through libc.

        1. 2

          Combine this with a way to only load the parts of libc that your application actually uses, and it should be a pretty nice improvement.

          1. 3

            Effectively, that’s pledge(2): https://man.openbsd.org/pledge.2

          2. 1

            a way to only load the parts of libc that your application actually uses

            That’s static linking. You really can’t load portions of a shared library, because a) it’s shared, and b) other programs might use other parts of said shared library.

            1. 2

              Even if other programs use different parts of the library I could elect to only map specific parts to my memory space. It’d be trickier to handle libraries loaded after entering main().