1. 44
  1.  

  2. 10

    That’s pretty impressive.

    On FreeBSD, memfd_create is spelled shm_open. There’s a libc shim that provides a Linux-compatible interface. futex is spelled _umtx_op, so could also be supported to some degree (except for the complex bitmap versions). Since FreeBSD 12, getrandom has been a system call (though the sysctl mechanism still works), to play better with sandboxing mechanisms that block sysctl access. The equivalents of sched_{set,get}affinity are cpuset_{set,get}affinity, though the mapping isn’t quite 1:1. kqueue can implement epoll, there’s a shim compat library, but it’s a much nicer API to work with so I’d prefer a portable libc exposed it and used libkqueue on Linux.

    The sad thing about Linux being the base platform is that you miss out on things like cap_enter and cap_rights_limit. I’ve written compartmentalised software with the APIs that Linux gives you and it is incredibly painful in comparison.

    1. 6

      This is very useful information. Thank you! Looks like shm_open is supported on XNU too. Tried to use _umtx_op once before, but it didn’t work out for some reason. I’ll have to give it another try.

      1. 3

        Tried to use _umtx_op once before, but it didn’t work out for some reason

        By default, it will operate on 64-bit words (in the futex-like mode), you will need to pass it the 32-bit flag to make it behave the Linux-compatible way. This is an unfortunate choice with Linux because you can typically emulate the 32-bit behaviour in userspace on top of a 64-bit primitive (spurious wakes are allowed), but not vice versa.

        1. 2

          That was likely it. Cosmo only uses futexes internally at the moment so it shouldn’t be a problem. There actually have been cases where we’ve used the FreeBSD APIs as our baseline, e.g. MAP_STACK, and we do try to make FreeBSD-only features available for use in Cosmo. I like FreeBSD a lot because it was the first unix operating system I installed in my home as a kid.

        2. 2

          While we’re giving feedback on specific syscalls, macOS’s libsystem has had native implementations of clock_gettime since 10.12, although I suppose it doesn’t map 1:1 to a syscall, so is that what the table reflects?

          (Both monotonic and realtime clocks are supported, although I suspect either CLOCK_MONOTONIC_RAW or CLOCK_UPTIME_RAW most closely match other platforms’ monotonic behaviour in edge cases, I don’t know how other platforms behave while the system is suspended though. If it’s speed you’re after at the expense of accuracy there are also _APPROX versions, though those definitely don’t map to syscalls as they are only updated on context switch.)

          1. 3

            Apple only lets dynamically linked programs use its clock_gettime implementation and Cosmopolitan Libc uses static linking. It’s possible to reverse engineer the shared memory page where Apple puts the conversion numbers but Cosmopolitan Libc would rather use the public well-known clockgettime() interface to be safe.

            1. 4

              More importantly: they don’t regard that interface as stable. Go didn’t go via libSystem and a minor update to macOS broke every single Go program.

              I’m actually not sure if Apple regards any of their syscall interface as stable. If you don’t go via libSystem then you may end up with things breaking between revisions.

              1. 3

                Apple doesn’t even consider their own libraries stable, if you statically link them. https://developer.apple.com/library/archive/qa/qa1118/_index.html However static linking isn’t forbidden. You can read my stance on the matter here: https://github.com/jart/cosmopolitan/issues/426#issuecomment-1166277706

                1. 3

                  That’s more or less what I thought. I don’t agree with your claim here:

                  The binary interface for UNIX SYSCALLs is something Apple inherited from the System V codebase (e.g. 1 for exit, 3 for read, etc.). They shouldn’t consider something like that an internal API, when it’s shared by so many operating systems.

                  Apple didn’t get anything from SysV: XNU was originally NeXTSTEP, where it was a 4BSD single-server UNIX on top of Mach, the 4BSD parts were moved into the Mach kernel and replaced with FreeBSD 5.x code, which were then updated. FreeBSD inherits the early system call numbering from BSD, which inherits them from AT&T UNIX, but they’ve always been local conventions. POSIX explicitly does not specify anything below the libc layer, it is a per-OS decision how to provide these interfaces and there’s no requirement that any of them are implemented in the kernel or in userspace. The codebase that NeXT originally inherited implemented UNIX system calls as Mach Port messages to the 4BSD server, not as direct system calls.

                  Solaris (which does inherit from SysV) similarly makes no guarantees about stability at the system call level and requires that you dynamically link their libc. This is often useful. System calls such as stat have grown larger structures over time. Without a stable system call ABI, you can use symbol versioning in userspace to provide two stat libc interfaces and have the conversion code run in libc. With a stable system call ABI, you must add a compat implementation of stat. Looking at the FreeBSD system call table, I see 83 COMPAT system calls. That’s a non-trivial amount of code running in ring 0, adding to the kernel attack surface, when it could be in ring 3.

                  Looking down the list (which Apple inherited when they pulled in the FreeBSD 5.x code), the first system call number that is maintained purely for legacy compatibility is 8. In NetBSD, it’s 7. It’s a bit of a stretch to suggest that this numbering anything other than an ad-hoc list of numbers.

                  1. 1

                    If anything, Linux (and Plan 9) is the one one out. Everyone else strongly recommends against it or has kludges to make it keep working (FreeBSD, macOS, Solaris), or will intentionally make it difficult (Windows, OpenBSD).

              2. 1

                Ah, that makes sense! The official realtime clock implementation literally just calls gettimeofday itself, so obviously that’s the way to go.

                If you’re not using the commpage (I’m not sure it’s quite reverse engineering to use the header file from the source code…) how do you scale the value from the TSC to a known unit? As far as I’m aware those scale factors are also communicated via the commpage. Or is it a case of assuming they’ll be the same for all modern Intel Macs, so YOLO and hardcode them?

                1. 2

                  clock_gettime(CLOCK_REALTIME) won’t scale RDTSC. We only do that now with CLOCK_MONOTONIC which on a couple platforms the wrapper turns ticks into nanos by dividing them by three, and it only does that when the processor says that RDTSC is invariant. It’s not great but it’s really the best we can do. Intel claims there’s a CPUID feature that’ll give us the conversion factor, but it’s unavailable on every CPU I’ve tested. The kernel has a way of getting this information in ring0, but there’s no reliable API for obtaining it from the kernel. Based on my testing, dividing by three comes very close on the Intel and AMD chips I’ve tested, but it may drift a bit from real time over time, even though it should maintain the monotonic invariant.

          2. 1

            On FreeBSD, memfd_create is spelled shm_open.

            No it’s not. memfd_create on FreeBSD is called memfd_create. Linux of course has shm_open, too, and memfd_create is different. In particular, shm_open has race conditions because every process uses the same namespace (in /proc/shm). memfd_create avoids this by letting every process have its own namespace.

            1. 3

              FreeBSD added anonymous shared memory objects support to shm_open some years before Linux got memfd_create. FreeBSD’s memfd_create is a userspace wrapper around the syscall. Your link is to the newer shm_open2, which adds flags for a Linux-compatible sealing mechanism. I consider any code that uses this feature to be deeply suspect because it relies on correct error-handling paths for security and if your security depends on the least tested (and hardest to test) parts of your code then it’s almost certainly broken.

          3. 6

            Please excuse my ignorance, but what does Metal mean in this context (the columns are Linux, FreeBSD OpenBSD, NetBSD, MacOS, Window, Metal)?

            1. 8

              Running on bare metal, ie directly from a boatloader, in ring 0, with no OS.

              1. 2

                You can syscall read? That’s pretty cool, didn’t know this, among the others…!

                1. 1

                  I don’t know if there’s filesystem support, I would imagine the read syscall is for reading from the keyboard through stdin

                  1. 3

                    Cosmopolitan’s read() function on bare metal is able to read from the serial port. There is some file system support. If your executable is also a zip file, then you can open() and read() the assets you’ve stored inside it.

                    1. 3

                      That makes sense, I forgot about the ZIP file functionality! I’m impressed there’s filesystem support - which filesystems currently have some support?

                      1. 5

                        The zip filesystem :-) It’ll be a glorious day when we develop it into a read/write rather than read-only system.

            2. 1

              How is FSGSBASE broken on windows? Curious as I have at least one application where it provides an appreciable performance boost; tested only on linux so far.

              1. 2

                I wrote some FSGSBASE test code that worked on Linux and FreeBSD. When I ran it on Windows, it produced a different number, as though it had read data from some entirely different address. If you’re saying it’s supposed to work, then do you know where Microsoft documents that fact? Is there a WIN32 API I need to call to unlock this? Also could you share what your use case is for FSGSBASE? I think it’s so cool and I want to learn more about ideas for its practical application.

                1. 3

                  If you’re saying it’s supposed to work, then do you know where Microsoft documents that fact? Is there a WIN32 API I need to call to unlock this?

                  No idea—was just curious if you knew anything. I am very windows-ignorant, but would like to get my code running there at some point if possible.

                  could you share what your use case is for FSGSBASE? I think it’s so cool and I want to learn more about ideas for its practical application

                  A parallel search problem. Ideally, this would be limited by throughput, but in order to eliminate latency as a bottleneck, you need to have sufficiently large buffers (cf). In this case, the buffer was GPRs—and I ran out. (Two registers might not seem like a lot, but if it’s what makes the difference between pipelining 3 and 4 searches at a time, you pay attention.)

                  I think I also came up with an application for semispace gcs at one point, never implemented that & don’t remember the details, unfortunately.

                  A problem with use of such features is interoperability. For my search problem, it was worth it to set and restore the segment registers around the search logic; doing that around every call is less savoury. If one prohibits FFI entirely, as is done by SICL—or simply relegates it to its own threads, like GHC—there may be room for exploitation in a language runtime for accessing various pieces of dynamic context.

                  1. 2

                    Thank you for sharing! Based on my microbenchmarks, the instructions go about as fast as mov so I suspect they could even be exploited not just by language runtimes, but also by gcc/clang in general. Any complex function that isn’t calling external apis or using tls would have two additional base address registers. But since you need a 2012+ CPU with a 2021+ kernel, its usefulness would be limited to a pretty small audience who’d have to enable it explicitly using multiple flags. The hardest part about it though, that makes it so stymied, is the CPUID test for it doesn’t work. Kernels would need to introduce a system call to let portable programs know if it’s safe to use.

                    1. 2

                      Kernels would need to introduce a system call to let portable programs know if it’s safe to use.

                      Kernels generally have interfaces to explicitly modify these registers (for the threading library to use), so any system with VDSO could just move the implementation into userspace. The CR4 bit was originally added for JVMs, which implement an N:M threading model with a potentially large number of userspace threads and the cost of the system call when toggling TLS for them was fairly noticeable. We got support pushed into the Linux kernel because it was useful with SGX enclaves (if you want to implement threads inside an enclave, you really don’t want a transition outside for the kernel to modify the register for you, and you need to do an exciting trampoline dance when it does because the register state set by the kernel is untrusted).

                      It would probably be a bad idea for GCC / clang to treat the TLS register as a GPR because signal handlers can get very confused if TLS doesn’t work.

                      1. 1

                        The CR4 bit was originally added for JVMs, which implement an N:M threading model with a potentially large number of userspace threads and the cost of the system call when toggling TLS for them was fairly noticeable

                        The way I heard it, JVMs that had this problem used the base of the stack as a TLS pointer (or as some kind of TLS key).

                      2. 1

                        Based on my microbenchmarks, the instructions go about as fast as mov

                        Huh, what hardware is this on? In a quick benchmark on skx, wr[fg]sbase seems to be considerably more expensive than mov, probably low-mid 10s of cycles.