1. 6
  1.  

  2. 2

    Nice work, and good write-up about the process. Having recently used posix_spawn for the first time, it does give the impression that eventually almost every syscall should be available as a file_action or attr

    It feels like it would be more composable (and maybe more unix-like?) to just spawn a little stub executable that contains all the post-fork initialization code before the desired executable is called with exec. Of course, that way, you miss out on posix_spawn’s ability to capture error codes and return them to the parent process.

    1. 2

      Nice work, and good write-up about the process. Having recently used posix_spawn for the first time, it does give the impression that eventually almost every syscall should be available as a file_action or attr

      This is precisely why:

      • posix_spawn is a terrible API and
      • There is no good reason to implement posix_spawn in the kernel.

      It is a horrible compromise design that was created with the express design constraint that it must be possible to implement entirely in userspace. This means that it can’t by any more expressive than vfork, just an API that is more restrictive and therefore harder to misuse.

      It feels like it would be more composable (and maybe more unix-like?) to just spawn a little stub executable that contains all the post-fork initialization code before the desired executable is called with exec. Of course, that way, you miss out on posix_spawn’s ability to capture error codes and return them to the parent process.

      This is almost what vfork does, except that it’s a function not a complete executable: it has the weird ‘returns-twice’ behaviour, so the typical way of using it is to call a setup function in the returns-as-child version. This can then do any system calls. It’s a bit clunky to use because you must not use malloc in that context[1], but if you do all of your allocation in the parent context, call vfork, do the system calls to set up child state, and then execve, when vfork returns the second time you can clean everything up and you get exactly the behaviour that you want.

      I use a wrapper around vfork that takes a lambda to run in the child context, which gives a much more UNIX-like behaviour than posix_spawn. Adding posix_spawn in the kernel adds a load of extra code that runs in the kernel, for no real benefit (unless you’re doing a truly huge amount of work in the spawn call such that the extra system calls in the vfork context would not be completely dwarfed by the process-creation overhead.

      [1] Well, you can if you’re careful. After execve, any allocated objects will remain allocated in the parent, so you must make sure that you capture them in the parent for cleanup. This is much easier in C++ where you can use RAII to capture all of the syscall arguments by having a std::vector or similar in the parent context that is passed by reference into the child vfork context. Or you can just use a block inside the child context that ends its scope immediately before the execve call so that there’s everything is deallocated in the child just before the new binary is loaded.

      1. 1

        That’s actually quite clever. The project I was working on is actually written in C++, but writing this kind of syscall-heavy code tends to put my brain in ‘C’-mode.

        1. 1

          This is almost what vfork does, except that it’s a function not a complete executable: it has the weird ‘returns-twice’ behaviour, so the typical way of using it is to call a setup function in the returns-as-child version.

          If you like vfork except for this, you might like my sfork variation, which is like vfork, except it doesn’t return twice: https://github.com/catern/sfork

          Edit: for anyone coming across this later, I submitted this as a top-level submission and discussed it there: https://lobste.rs/s/vzavsz/sfork_synchronous_single_threaded

        2. 1

          It feels like it would be more composable (and maybe more unix-like?) to just spawn a little stub executable that contains all the post-fork initialization code before the desired executable is called with exec. Of course, that way, you miss out on posix_spawn’s ability to capture error codes and return them to the parent process.

          More linuxy than unixy, but how about a system call like ebpf_spawn(...) that would run a short bytecode program before exec?

          1. 4

            Aside from the fact that eBPF is a security disaster, why run a bytecode program in the kernel to do something that you can already do with a native program in userspace?