1. 21
  1.  

  2. 5

    If I understand this correctly, the flow is:

    1. Call sfork, now you are in the parent process but with the child’s view of kernel state (file descriptor tables and so on).
    2. Open / close files, do whatever setup you need.
    3. Call sfork_execve, which doesn’t execve in the parent but instead starts the binary in the child and resets the parent’s view of kernel state to the parent’s old state.

    There are some really nice things here. In particular, it avoids the memory-management headaches of vfork. Anything done between sfork and sfork_execve is still visible in the parent after the sfork_execve call, which means it’s easy to free the state.

    The thing that I don’t like is that it still requires copying the parent FD table (which imposes some synchronisation and memory-management overhead in the kernel, potentially a lot in a large server process). I’d prefer that the child started with an empty FD table but added an sfork_dup2 where the source was in the parent’s FD namespace and the target was in the child’s, so you could do something like:

    int some_privileged_file = open(...);
    sfork();
    // Inherit stdin / out / errno
    sfork_dup(STDIN_FILENO, STDIN_FILENO);
    sfork_dup(STDOUT_FILENO, STDOUT_FILENO);
    sfork_dup(STDERR_FILENO, STDERR_FILENO);
    sfork_dup(PrivilegedFDNumber, some_privileged_file);
    // Enter jail
    jail_attach(...);
    // Drop root privilege
    setuid(...);
    // Open some more files with reduced privileges, inside the jail
    open(...);
    open(...);
    sfork_execve(...);
    // Free any memory you allocated here.
    free(...);
    close(some_privileged_file);
    

    The sfork_execveat is slightly weird. I don’t know why you’d want that instead of an analogue of fexecve. It means that you can’t execve a binary that the child doesn’t have access to by the time that you get to the sfork_execve. Passing the file descriptor to the binary is simpler and if you open it with the close-on-exec flag then it doesn’t end up in the FD table of the child.

    1. 1

      The thing that I don’t like is that it still requires copying the parent FD table (which imposes some synchronisation and memory-management overhead in the kernel, potentially a lot in a large server process). I’d prefer that the child started with an empty FD table but added an sfork_dup2 where the source was in the parent’s FD namespace and the target was in the child’s, so you could do something like:

      Linux’s close_range(2) removes the need for this to some degree - by closing most fds at the same time as creating the new FD table with CLOSE_RANGE_UNSHARE, you remove most of the synchronization needed. pidfd_getfd(2) also helps here.

      That is, instead of copying the fd table at sfork time, you’d run sfork(CLONE_FILES) so the new process shares the parent’s fd table, then you’d unshare/copy it explicitly by calling close_range(CLOSE_RANGE_UNSHARE).

      The sfork_execveat is slightly weird. I don’t know why you’d want that instead of an analogue of fexecve. It means that you can’t execve a binary that the child doesn’t have access to by the time that you get to the sfork_execve.

      sfork_execveat is just exactly Linux execveat(2) wrapped so it can return. Like the unwrapped version of execveat, it supports fexec by using AT_EMPTY_PATH.

      Tangentially, see “Bugs” in the Linux execveat manpage for an unfortunate interaction between fexec, CLOEXEC on the executable fd, and shebang files, which means you can’t safely set CLOEXEC. I don’t know if this also applies on FreeBSD.

      1. 2

        You can safely set CLOEXEC if you know that the file is not interpreted (which is the case for a lot of my uses, though not in the general case). That said, it’s usually fine to just leave an execute-only file descriptor open in the child - the kernel will keep a reference to the file on disk open for the lifetime of the program anyway so you’re using a tiny bit of extra kernel state but not really impacting anything in the child.

        I’m quite tempted to do a FreeBSD implementation of a variant of sfork with the sfork_dup2 variant that I proposed above.

        1. 1

          I’m quite tempted to do a FreeBSD implementation of a variant of sfork with the sfork_dup2 variant that I proposed above.

          If you do I’d be very interested in hearing about it. As you might perceive, I spend a lot of time thinking about better process creation and management mechanisms for Unix.

          If you do make sfork_dup2, I’d recommend making it a bit more generic so it can be used outside of the sfork context. You could imagine that it works to copy file descriptors from your parent, iff you’re running in the same uid/security context as your parent. Or perhaps go all the way and make a pidfd_getfd equivalent. Either way it would then be useful for rsyscall-style approaches where you create an essentially-empty child process and then populate it from the parent by remote-calling sfork_dup2/pidfd_getfd/etc inside the child.

          Note that the nature of sfork prevents you from using SCM_RIGHTS to implement sfork_dup2, otherwise that would be a tempting option.

          For completeness, not that you mentioned this: The other class of alternative is a system call which takes a list of file descriptors and copies them all at once into a new fd table. In some sense this is what close_range(CLOSE_RANGE_UNSHARE) does. I’ve explored this option in rsyscall but found it doesn’t work as well, because it’s anti-modular and has many of the same downsides of posix_spawn.

          1. 3

            If you do I’d be very interested in hearing about it. As you might perceive, I spend a lot of time thinking about better process creation and management mechanisms for Unix.

            There are some folks thinking about this in FreeBSD because of the colocated process work for CHERI (which allows multiple OS processes in the same address space, with CHERI providing the isolation). This is currently done with vfork + coexecve and vfork works surprisingly well but isn’t quite ideal.

            Between CHERI, jails, and Capsicum, there are a lot of things that require flexibility from process creation. I think a FreeBSD-flavoured version of sfork would mirror pdfork for the initial call, would provide an empty FD table, add an sfork_dup2 to explicitly inherit some file descriptors, and would then use an sfork_fexecve mirroring fexecve to create the process.

            A userland wrapper could implement your sfork_execveat by doing an openat of the file, reading the first two bytes, and setting the close-on-exec flag if they are not ‘#!’, then doing sfork_fexecve, but the kernel primitive could also execute files in directories that the child doesn’t have access to.

            Similarly, a userspace wrapper could implement your sfork behaviour with:

            int sfork(void)
            {
              int child_fd = pdsfork(PD_DAEMON);
              if (child_fd < 0)
              {
                return -1;
              }
              int ret = pdgetpid(child_fd);
              close(child_fd);
              return ret;
            }
            

            So the only four syscalls that you’d need are:

            • pdsfork
            • sfork_exit
            • sfork_dup2
            • sfork_fexecve

            With sfork and sfork_execveat implemented in userspace if anyone wants them (anything involving PIDs is inherently racy, so I’d much prefer people used the process descriptor versions, which aren’t).

            If you do make sfork_dup2, I’d recommend making it a bit more generic so it can be used outside of the sfork context.

            This has a lot of subtlety to get right. You definitely wouldn’t want a capability-mode process to be able to do that (and, in my ideal world, all processes end up being capability-mode processes) and it also interacts poorly with any other security feature. You’d need to check for any change in any security context, which seems fragile. In contrast, in the sfork context the only code that can run is code that is explicitly authorised by the parent process.

            You could implement something similar to pidfd_getfd by adding a permission on process descriptors that allows it but it’s an awful API because it’s inherently racy: the source process can modify its FD table concurrently with the target’s pdifd_getfd call and it doesn’t know that the target is doing this call so doesn’t have any inherent mechanism to synchronise with.

            The Windows equivalent DuplicateHandle is more symmetrical (you need the equivalent of a process descriptor for both processes and you must have the rights to modify the equivalent of the FD table in both) but it is almost always used in the push model, because the sender can guarantee that the handle is still valid.

            Restricting it to the sfork context is also nice because it avoids all of these concurrency problems (or, at least, means that they are limited to the same set as in any UNIX program): unless another thread closes and opens file descriptors in the parent, they will not be changed and the parent process knows that it should avoid concurrent modification of any of the file descriptors that are referenced in the sfork context (you can always write buggy code, but at least in this specific use case there is a mechanism for writing correct code). This may also let you avoid some locking in the kernel because you can’t create new threads in the sfork context and so you can skip any locking while modifying the child’s FD table.

            For completeness, not that you mentioned this: The other class of alternative is a system call which takes a list of file descriptors and copies them all at once into a new fd table.

            This is what I was originally considering. Actually a list of index-fd pairs, so that you can specify gaps in the FD table. This has the phase ordering problem that you mention though - it’s fine if you want to populate those bits of the child’s FD table and then open things in the child context, it doesn’t work if you need to open some things in the child and then use some things from the parent. I’m not sure that this is actually a required pattern but the sfork_dup2 mechanism is strictly more expressive since you could ask another thread to open files for you after you’ve dropped privileges in the sfork thread (again, probably not something you’d actually want to do, but it is possible).

            In some sense this is what close_range(CLOSE_RANGE_UNSHARE) does

            I don’t know what that does. FreeBSD has a close_range system call, but not a CLOSE_RANGE_UNSHARE flag, if Linux has a close_range then it’s not documented.

            1. 1

              So the only four syscalls that you’d need are:

              pdsfork sfork_exit sfork_dup2 sfork_fexecve

              Yes, that makes sense to me. I only return pids because I wrote sfork before pidfds were merged, and I only use execveat because that’s the most powerful option on Linux, but I certainly prefer a capability approach instead.

              This has a lot of subtlety to get right.

              I agree with all this. I haven’t yet thought of a clean way to approach it that is also generic to many use cases. Piggy-backing on SCM_RIGHTS is a nice approach for rsyscall (where the two processes execute concurrently), maybe there’s an SCM_RIGHTS-based approach that works for sfork?

              I don’t know what that does. FreeBSD has a close_range system call, but not a CLOSE_RANGE_UNSHARE flag, if Linux has a close_range then it’s not documented.

              I linked the manpages in my original comment (you may not have seen them because I edited it in): https://lobste.rs/s/vzavsz/sfork_synchronous_single_threaded#c_vbldck

              But, here’s the Linux close_range(2) manpage, complete with CLOSE_RANGE_UNSHARE: https://man7.org/linux/man-pages/man2/close_range.2.html

          2. 1

            Oh, I should also say: Even if you use an sfork_dup2 variant you should still automatically copy all fds which are not marked CLOEXEC. That way you preserve the traditional Unix implicit-fd-inheritance feature, which is desirable for more than just stdout/in/err. You could imagine using it for a root filesystem or cwd fd, or to inherit the capability to perform some operation that is configured by an also-implicitly-inherited environment variable.

            In programming languages, this is like “implicit parameters”, which are a useful feature. In Unix, the same is true: It’s a useful feature, not a bug.

            The design bug is just that implicit-fd-inheritance is on by default, which is very annoying; it should be that CLOEXEC is the default and you opt in to implicit-fd-inheritance. But that’s no reason to remove the feature entirely.

            1. 1

              I strongly disagree on this: the first thing that I always end up doing in vfork is calling closefrom to ensure that I haven’t accidentally inherited any file descriptors. The UNIX behaviour was fine in single-threaded UNIX programs, but it’s inherently racy in multithreaded programs because any other thread may have called library functions that open files and then close them and don’t bother setting O_CLOEXEC because they are closing the file descriptor in a local scope.

              If you rely on implicitly inheriting file descriptors then you rely on every library routine that every other thread in your program calls correctly setting O_CLOEXEC on every file descriptor. You also rely on some kernel APIs that don’t exist. For example (as far as I can tell) there is no socket or socketpair analogue (even in Linux, which has done a pretty good job of adding these) that provides a CLOEXEC flag, so the other thread needs to call socket and then fcntl(, F_SETFD) and hope that you didn’t do the [s,pd,v]fork call in between the socket call and the fcntl call.

              It also violates the principle of intentionality. You should never exercise rights without explicitly choosing to. Inheriting a file descriptor without explicitly choosing to is passing a set of rights to the child process without meaning to, which is a non-intentional exercise of privilege. There have been a lot of security vulnerabilities caused by this mechanism, which is what closefrom exists and why calling it before execve is something that you’ll find in any secure programming guide for *NIX.

              1. 2

                but it’s inherently racy in multithreaded programs because any other thread may have called library functions that open files and then close them and don’t bother setting O_CLOEXEC because they are closing the file descriptor in a local scope.

                If you rely on implicitly inheriting file descriptors then you rely on every library routine that every other thread in your program calls correctly setting O_CLOEXEC on every file descriptor.

                Absolutely, so that’s why CLOEXEC should be the default.

                IMO, there should be a PR_DEFAULT_TO_CLOEXEC prctl that one can set, so that all new file descriptors are created with CLOEXEC by default. Then processes could opt-in to the sane behavior and not have to worry about libraries calling functions without passing CLOEXEC.

                For example (as far as I can tell) there is no socket or socketpair analogue (even in Linux, which has done a pretty good job of adding these) that provides a CLOEXEC flag,

                SOCK_CLOEXEC works for socket and socketpair, at least on Linux. I believe Linux has essentially complete coverage for CLOEXEC-everywhere.

                It also violates the principle of intentionality. You should never exercise rights without explicitly choosing to. Inheriting a file descriptor without explicitly choosing to is passing a set of rights to the child process without meaning to, which is a non-intentional exercise of privilege.

                Before I give my take, let me first establish that I’m hard-core in favor of explicit capability-security. I want to go as far as possible in removing any kind of implicit authority. I’ve worked hard on that in rsyscall, and I think rsyscall is a good demonstration of how explicit over implicit gives you really amazing powers and how it’s really worth it to be explicit.

                That being said, it’s precisely my hard-core belief in explicit-passing-of-capabilities that makes me interested in the places where implicit-passing is useful. Environment variables and exception handlers are two widespread examples of implicit-passing/dynamic-scope. Many people find those very useful.

                Note that implicit inheritance is already widespread in Unix: the root directory, CWD, lots of other things that are implicitly copied by a fork. In some sense, the kernel and host you’re running on is something you implicitly inherit, and it’s hard to get rid of that (rsyscall is my attempt). I emphatically agree that we should make these all explicit.

                But, it’s sometimes useful to have things be implicit, and why shouldn’t user programs be able to make their own implicitly-inherited capabilities, if they explicitly choose to make something implicit? They can implicitly inherit data with environment variables; why not capabilities? Why should only the OS designers be allowed to implicitly inherit capabilities?

                1. 1

                  IMO, there should be a PR_DEFAULT_TO_CLOEXEC prctl that one can set, so that all new file descriptors are created with CLOEXEC by default. Then processes could opt-in to the sane behavior and not have to worry about libraries calling functions without passing CLOEXEC.

                  I don’t disagree in principle, but I think it’s a very difficult thing to retrofit to the ecosystem, for a couple of reasons. First, you’d need to either add an NOCLOEXEC flag to everything that currently has a CLOEXEC flag and that space is quite constrained. Not insurmountable though, especially on Linux where they’re now favouring system calls that take a pointer to an extensible structure as the parameter and ignoring the fact that this completely breaks seccomp-bpf’s ability to do anything useful.

                  Second, it affects the behaviour of all libraries, so any library that actually wants the inherited behaviour would need to be modified. My guess is that this is approximately zero libraries, so maybe it’s not a problem.

                  SOCK_CLOEXEC works for socket and socketpair, at least on Linux. I believe Linux has essentially complete coverage for CLOEXEC-everywhere.

                  Thanks, I missed that in the man page. It’s also there for FreeBSD.

                  That being said, it’s precisely my hard-core belief in explicit-passing-of-capabilities that makes me interested in the places where implicit-passing is useful. Environment variables and exception handlers are two widespread examples of implicit-passing/dynamic-scope. Many people find those very useful.

                  Environment variables aren’t implicit. They are an explicit parameter to execve and there is no default of ‘pass all of my argument variables to the child’. This is easy to do (just pass environ to that parameter) but the important thing is that, at the point of use, there is an explicit policy decision on what to pass. Shells, in general, pass everything. Things like ssh and sudo provide an allow list and pass any of the things that are set. Some things provide a block list and pass anything that isn’t on that list.

                  Exceptions come in many forms but in general unchecked exceptions (i.e. ones where the type system doesn’t define the set of exceptions that can be thrown) are a massive source of bugs (and security vulnerabilities). C++ has only unchecked exceptions and every large-scale C++ project that I’ve worked on has had a policy of ‘don’t use exceptions’ (often compile with -fno-exceptions to enforce this). Java has a mix, everything is a checked exception except that subclasses of RuntimeException are unchecked (any object access can throw a NullPointerException, and there’s nothing in the type system that lets you prove to the compiler that your function can’t throw these. Similarly, any allocation can cause an OutOfMemoryException). The unchecked exceptions in Java have been a source of a large number of bugs and security vulnerabilities.

                  Note that implicit inheritance is already widespread in Unix: the root directory, CWD, lots of other things that are implicitly copied by a fork. In some sense, the kernel and host you’re running on is something you implicitly inherit, and it’s hard to get rid of that (rsyscall is my attempt). I emphatically agree that we should make these all explicit.

                  The root directory is implicit, but in a capability world you don’t have access to the namespace that contains it. CWD is one of my pet hates in *NIX. It should not be something that the kernel provides, there’s no need for it (especially with openat and friends), userspace APIs could implement it for programs that want it. Again, both of these have been sources of vulnerabilities. For example, chroot is restricted to root because if you could chroot and run a setuid binary then you could trick it into modifying the wrong bit of the filesystem. Similarly, there have used to be vulnerabilities from running some setuid daemons that then wrote to their current directory, when they expected to be run in a specific place.

                  But, it’s sometimes useful to have things be implicit, and why shouldn’t user programs be able to make their own implicitly-inherited capabilities, if they explicitly choose to make something implicit? They can implicitly inherit data with environment variables; why not capabilities? Why should only the OS designers be allowed to implicitly inherit capabilities?

                  Because someone has to reason about the security properties of the system. If I run a child process with a restricted set of rights, I can audit the code around process creation to be able to tell exactly what the child process can do. If rights can be implicitly inherited then I have to audit millions of lines of code to see what can be inherited.

                  UNIX has a few bits of ambient authority, but I regard these as bugs not features.

                  1. 1

                    First, you’d need to either add an NOCLOEXEC flag to everything that currently has a CLOEXEC flag and that space is quite constrained

                    Just on this point, I don’t think adding such flags would be necessary; unsetting CLOEXEC after the fact using the existing fcntl F_SETFD is sufficient for all correct usage.

                    1. 1

                      Yes, you’re right - you don’t have any atomicity requirements in this direction. Possibly the explicit F_SETFD would be sufficient - I doubt anyone is creating enough intentionally inherited file descriptors that an extra syscall would cost a measurable amount of performance.

        2. 1

          It isn’t really clear to me what the advantage is over posix_spawn(). As you note, you’re going to end up needing sfork-specific functions for arranging for things in the child – which is pretty much already how posix_spawn() works.

          1. 3

            You don’t need sfork-specific functions for doing anything in the child that you can do with normal system calls. If you want to make posix_spawn as expressive as sfork or vfork then you need spawn operations for openning files, creating / binding sockets, along with setuid, jail_attach, cap_enter, and any other things that change credentials. With sfork you don’t need any of these. The only thing that you would need (if it follows my suggestion of starting with an empty file-descriptor table) is a mechanism for copying a file descriptor from the parent to the child.