1. 17
    1. 5

      Great overview, which would have been very useful to me two years ago.

      FreeBSD’s UMTX_OP_SEM2_WAIT operation on _umtx_op has a feature that I liked so much I copied it into every API that takes a timeout in CHERIoT RTOS: If it returns for any reason other than timeout, the timeout value is updated to contain the remaining time. This means that you can retry by simply passing the same timeout in a loop and will eventually exit with a timeout. I wish the other _umtx_op things had this.

      Use UMTX_OP_WAIT_UINT_PRIVATE and UMTX_OP_WAKE_PRIVATE for a intra-process futex; there are no 64-bit intra-process futexes.

      I’m not sure what this means, but I don’t think it’s right. By default, the FreeBSD APIs operate on uintptr_t, so they’re 64-bit if you’re in a 64-bit process, 32-bit in a 32-bit process. There’s a UMTX_OP__32BIT flag that lets you do a 32-bit operation, which is mostly there for sharing memory segments shared between 32- and 64-bit processes but is also useful for Linux-compatible code that wants a 32-bit raw futex. It’s not, unfortunately, possible to 64-bit operations from 32-bit processes (I suspect because some supported 32-bit architectures don’t have 64-bit atomics).

      The FreeBSD futex implementation in the Linux ABI layer was recently rewritten to share the implementation with _umtx_op, which should make it possible to have a futex in shared memory that’s used by both Linux and FreeBSD processes, though I don’t think anyone has tried. I hope that FreeBSD will get expose the futex system call at some point in the native system call table: I much prefer _umtx_op for code I’m writing, but not having to provide a futex-compat layer for code that was written on Linux makes porting easier.

      Windows has a relatively spare interface, though it also has the dubious honour of being the only OS to implement 8- and 16-bit futexes (these can trivially be implemented in userspace).

      Only on systems without fine-grained memory safety. You can’t do it on CheriBSD, for pure-capability binaries, for example, because you can’t do a 32-bit CAS on an object where you have only a one-byte-length pointer.

      I believe NetBSD has the same futex implementation as OpenBSD.

      1. 4

        If it returns for any reason other than timeout, the timeout value is updated to contain the remaining time. This means that you can retry by simply passing the same timeout in a loop and will eventually exit with a timeout. I wish the other _umtx_op things had this.

        I’m not sure I’d want to rely on that. (I guess it’s fine on a RTOS?) If the process is interrupted after the function has determined the remaining time, but before the next call is placed, the remaining time can be basically arbitrarily wrong. I think a cleaner interface would be:

        • have the timeout be a target time, not a duration
        • have a helper function that returns “current time plus offset”.

        That way, you can just pass in the same value on successive calls in a loop.

      2. 2

        Use UMTX_OP_WAIT_UINT_PRIVATE and UMTX_OP_WAKE_PRIVATE for a intra-process futex; there are no 64-bit intra-process futexes.

        I’m not sure what this means

        I’m talking about 32- vs 64-bit futex values, not 32- vs 64-bit platforms; I’m assuming a 64-bit cpu/kernel/userland in either event.

        32-bit inter-process: UMTX_OP_WAIT_UINT
        32-bit intra-process: UMTX_OP_WAIT_UINT_PRIVATE
        64-bit inter-process: UMTX_OP_WAIT
        64-bit intra-process: nothing

        The commit that added intra-process futexes added them only for 32-bit values, even though it postdates 64-bit; I have no idea why it didn’t add them for 64-bit values too.

        Using exclusively inter-process futexes is bad for performance because the kernel can encounter contention between unrelated futexes from completely unrelated process; with intra-process futexes, it can use a per-process table, reducing contention.

        FreeBSD’s UMTX_OP_SEM2_WAIT operation on _umtx_op has a feature that I liked so much I copied it into every API that takes a timeout in CHERIoT RTOS: If it returns for any reason other than timeout, the timeout value is updated to contain the remaining time

        I dislike this for the same reason as FeepingCreature. Better might be if you pass in a relative time initially, and it returns an absolute time you can use thereafter. But nowadays, we have cheap, high-precision user-space timers, so I don’t think it’s a big deal either way. These are low-level apis anyway; ergonomics should be a low priority.

        1. 1

          I’m talking about 32- vs 64-bit futex values, not 32- vs 64-bit platforms; I’m assuming a 64-bit cpu/kernel/userland in either event.

          That’s what I assumed and that’s what’s confusing me. It turns out I misread the man page, thanks. I thought the pointer for the UMTX_OP_WAIT was a uintptr_t, not a long, and the UMTX_OP__32BIT flag changed it to uint32_t on 64-bit values.

          I dislike this for the same reason as FeepingCreature. Better might be if you pass in a relative time initially, and it returns an absolute time you can use thereafter

          Using absolute times for timeouts has all sorts of problems that phk enumerates clearly. They’re okay only if the clock is the monotonic clock, for everything else they are made entirely out of footguns.

          1. 1

            They’re okay only if the clock is the monotonic clock, for everything else they are made entirely out of footguns.

            When would it make sense to (relative) wait on a non-monotonic clock source?

          2. 1

            Using absolute times for timeouts has all sorts of problems that phk enumerates clearly.

            Who or what is phk? Absolute time is reliable and usable only in monotonic clock, that goes without saying. Relative time leads to drifting due to not accounting for delay in using obtained information about remaining time, which may or may not be acceptable depending on the use case. Kind of a potential TOCTOU issue but related to time resource itself.

            1. 1

              PHK is Poul-Henning Kamp a prominent FreeBSD developer

              https://en.wikipedia.org/wiki/Poul-Henning_Kamp

              Absolute time is reliable and usable only in monotonic clock, that goes without saying.

              This then leads the question if you can have a always reliable monotonic clock. Something that is not certain when you take external factors like NTP into the equation.

              But in the end it likely depends on what kind of timing you need.

    2. 3

      The article starts with:

      Futexes are clearly the best low-level synchronisation mechanism.

      Can someone explain why?

      1. 2

        simple, expressive, and performant

        1. 2

          I don’t think this is true. Futexes require synchronizing threads on a process-wide shared state and get slower the more active & unique addresses there are being waited on.

          A better API for thread blocking/unblocking would be lwp_park/lwp_unpark from NetBSD or NtWaitForAlertByThreadId/NtAlertThreadByThreadId on Windows 8+. These only signal a specific thread and wait with the same thread so much less shared internal synchronization. They also don’t do any thread queueing which means that’s the job of userspace (I believe for better, not worse).

          Windows “futex” (WaitOnAddress) is actually implemented in userspace using the alert functions from prior on individual threads after queueing to a shared table in the PEB (ProcessEnvironmentBlock). Windows’ SRWLOCK/CONDITION_VARIABLE and NetBSD’s pthread_mutex_t/pthread_cond_t also queue & wake threads manually, but without having to go through a process-wide shared state.

          1. 1

            Futexes require synchronizing threads on a process-wide shared state and get slower the more active & unique addresses there are being waited on

            Futexes are usually implemented using a hash table; hash table can be sized dynamically according to the number of active futexes in the process, so realised contention should be mostly true contention, with only a little bit extra from collisions and false sharing.

            1. 1

              Most seem to avoid dynamic allocation as it involves even more synchronization to concurrently grow or reclaim memory. For reference:

              • Linux futex at startup will allocate 256 * ncpu buckets and use chaining O(n) for collision resolution. So more threads which collide may mean more bucket lock contention and longer queueing ops for threads waiting on unrelated resources.
              • MacOS ulock does similar but with more conservative bucket sizing.
              • Golang sema (its own futex) uses fixed 251 buckets but with a Treap O(log(n)) for collision resolution.
              • Rust parking_lot indeed does grow a hash table but it requires rehashing (locking all buckets) and it never frees / shrinks buckets either.
              1. 1

                it involves even more synchronization to concurrently grow or reclaim memory

                NonBlockingHashMap demonstrated concurrent resizing without the need for synchronisation in 2007. I can’t speak for the implementations, but the point is that there is nothing wrong with the policy.

                1. 1

                  If this is a reference to that of Java’s, NonBlockingHashMap relies on the JVM garbage collector for concurrent memory reclamation. This isn’t available in systems programming languages so they often use algs like EBR/HazardPointers/QSBR/etc. or just not freeing at all like in parking_lot’s case.

                  1. 1

                    Ideally we would simply not use languages without pervasive GCs, but yes, there are other mechanisms for concurrent memory reclamation. I haven’t used any of the other things you mention, but hazard pointers are pretty fast (especially when compared with the cost of a system call). The difficult part with a hash table is maintaining the table semantics while resizing, without the need to synchronise (if two threads both want to perform a resize, they need to synchronise—fine), which is what NonBlockingHashMap solves and which is what I assumed you were referring to.

                    1. 1

                      The synchronization overhead from resizing mostly refers to making sure the old hashmap memory after a resize isn’t invalidated while others are using it, not registering the new memory. To achieve this, the reclamation algs either add overhead to the readers of the table memory (EBR/HazardPtr) or require a safepoint all threads have to go through (QSBR). Both could leave the old map memory active for longer than necessary when it comes to use in a futex-like API and the former slow down normal accesses. Which is why growing, or at least freeing, futex memory is avoided in practice.

                      But coming back to the original point: All of this hashing, memory tracking, and shared locking inherent to futex APIs could be avoided with primitive-specific queues by leaving that to userspace and the OS only providing a thread park()/unpark() API.

                      1. 1

                        add overhead to the readers of the table memory (EBR/HazardPtr)

                        The hazard pointers I know (https://pvk.ca/Blog/2020/07/07/flatter-wait-free-hazard-pointers/) have only minimal overhead for readers—much less than a system call.

                        All of this hashing, memory tracking, and shared locking

                        My point is that I think the performance impact is not significant. Hash tables certainly are fast; address translation is a bit irksome (futexes are keyed by physical address, not virtual address, though this is only practically significant for inter-process futexes), but not too bad. I think the only issue that might be worth considering is whether futexes force undue contention that could be avoided with another strategy, and I think I’ve shown that they don’t.

                        1. 1

                          minimal overhead for readers

                          TIL about this method of relying on OS preemption to synchronize with readers.

                          I think the performance impact is not significant.

                          It can be when it comes to designing high perf sync primitives like channels, event counts/listeners, thread pools, and even just mutexes. There, you end up managing thread/task queues yourself for efficient scheduling. A futex per thread would then boil down to a worse version of park/unpark.

                          Hash tables certainly are fast;

                          Yes, but avoiding the hashmap altogether with primitive specific queues would be faster (and simpler for the kernel to implement too).

                          whether futexes force undue contention that could be avoided with another strategy, and I think I’ve shown that they don’t.

                          It’s been shown that using a hashmap for a futex impl is as viable as fixed sized buckets as long as thread counts are known and there’s a background thread or regular interval for cleanup. What I don’t think has been shown yet is how futexes are as efficient as thread park/unpark APIs which can be alternatives.

                          There’s still the intermediary locking of the buckets (I think this could be avoided, but no futex impls do so, or may be able to, as of yet), lack of control of wake policy (cant keep certain threads waiting on the same futex ptr while waking others), and lack of control on wait verification (generally only allow integer comparisons or bitwise ops, and only on the ptr being used to wait).

                          If there’s ways to get around those, that would be ideal.

    3. 1

      I’m not really familiar with futexes, but I know the name comes from “Fast User-space muTEX”. So I’m confused that the paper linked to on this page says there is a “kernel interface defined by this system call…” because if it makes a system call why is it Fast or User-space?

      (Also, I guess I don’t need to worry about this since if I use C++ std::atomic it manages the futexes. For me, at least on Darwin?)

      1. 3

        Never mind, I should’ve known better than to ask before checking Wikipedia. The answer is:

        A futex consists of a kernel-space wait queue that is attached to an atomic integer in userspace. Multiple processes or threads operate on the integer entirely in userspace (using atomic operations to avoid interfering with one another), and only resort to relatively expensive[citation needed] system calls to request operations on the wait queue (for example to wake up waiting processes, or to put the current process on the wait queue). A properly programmed futex-based lock will not use system calls except when the lock has contended; since most operations do not require arbitration between processes, this will not happen in most cases.

        So it’s a low-level primitive used as part of a fast mutex. Sounds like the kind of thing almost no one should be calling directly, rather using some well-tested wrapper that adds the user space parts. Such as std::atomic.

        1. 2

          So it’s a low-level primitive used as part of a fast mutex. Sounds like the kind of thing almost no one should be calling directly

          Not entirely. As I explained in the other futex thread, a futex is basically a compare-and-wait primitive. It’s useful in any data structure that’s using atomics directly but wants to sleep in cases of contention. 99% of users will just use a mutex, but there are a lot of possible abstractions that you can build on top of it that you can’t easily do on top of other abstractions. The Linux version is designed to allow a 32-bit word to represent a set of 32 locks and give you a primitive to (efficiently) atomically acquire and release any subset of them, which is something I’ve never wanted to do and have never seen in a library, but which apparently some people found useful.

          1. 1

            That 32-locks thing sounds incredibly cool, even though I can’t think of a use for it…

            1. 1

              The problem is contention: acquiring (and/or releasing) the locks requires a CAS loop, so if you’re actually using the locks a lot then you may fail to make forward progress. It’s not clear to me what situation that would be useful. It would have to be something where the ability to acquire multiple locks simultaneously (rather than sequentially with a well-defined total ordering) was important or something where space was at a premium and contention was rare.

      2. 1

        Also, I guess I don’t need to worry about this since if I use C++ std::atomic it manages the futexes. For me, at least on Darwin?

        Yes and no. On Windows, with the Visual Studio standard library, to can use std::atomic<T>, where T is an 8, 16, 32, or 64-bit integer type to wrap the Win32 futex-like APIs. On Darwin, you are probably using LLVM’s libc++, which made some incredibly bad ABI decisions on atomics. The only type that will use the platform’s futex-like abstraction is __cxx_contention_t. It’s possible that types of the same size will as well, but I’m not 100% sure. This means that, on Darwin, std::atomic<int64_t> will use the native futex-like APIs, but std::atomic<int32_t> will not (I really hope std::atomic<uint64_t> will, but I’m not 100% sure. This code won’t then be usefully portable: on Linux std::atomic<int32_t> uses futex but std::atomic<int64_t> does not. For types other than __cxx_contention_t, it uses a semaphore from a table for the sleep.

        I really hoped to fix this nonsense before FreeBSD 14, but haven’t had time yet.

        Oh, and the C++ spec here is impossible to usefully implement. Anything like this has a choice between having spurious wakes or having ABA problems. All kernel primitives that I’m aware of permit spurious wakes, but the C++ spec does not, so the C++ standard library has do do a check that guarantees that it will suffer from ABA issues. I have no idea why they decided to do the obviously wrong thing here.

        1. 1

          Sigh. Thanks for the reality check. Me, I’ve never actually used any of the blocking methods of std::atomic.

          Are futexes used in the implementation of std::mutex and std::condition_variable? Those I do use.

          1. 2

            Are futexes used in the implementation of std::mutex and std::condition_variable? Those I do use.

            They should be. I believe, in libc++, both of those delegate to pthread. Darwin now uses an implementation of those that uses their futex-like calls. There was a slightly embarrassing time around 10.6 when Apple was loudly telling everyone ‘look how amazing the libdispatch semaphores are’ and people pointed out that they weren’t really that fast compared to Linux / FreeBSD mutex and semaphore implementations but all of these were much faster than the Mach-based pthread mutexes on OS X, which didn’t have a userspace fast path at all.

            At some point, I want to add implementations of std::mutex and std::condition_variable that use the futex directly so that their fast paths can be inlined. Both have an API to return a native handle. A small amount of code requires that to be a pthread type, a large amount of code doesn’t care. I wish they’d added a template parameter or something to say ‘a mutex that wraps this host system type’, so std::host_mutex<pthread_mutex_t> and std::host_mutex<struct umtx> could both exist and the most efficient one could be exposed as std::mutex for people that didn’t care about the underlying type. As it is, I at least want to be able to statically link a version that’s fast.

            The uncontended lock path for a mutex should be 3-4 instructions and it drives me slightly crazy that this ends up being a cross-library function call.

    4. 1

      How is this different to a condition edit: monitor variable?

      1. 1

        Futexes are the thing that the implementation of your condition variable / monitor variable uses to park the thread when the condition isn’t satisfied.

        1. 1

          I see. Condition and monitor vars are already extremely low-level synchronization primitives, and are notoriously difficult to “get right” in application code. A primitive at an even lower level of abstraction doesn’t strike me as something that can be effectively deployed by applications. But, to each their own!

          1. 1

            You absolutely should not be using futex in application code. It is a kernel interface for implementing low-level locking (and a few other) primitives in standard libraries and concurrency libraries. The idea behind a futex is to allow a pure userspace fast path. If you acquire an uncontended mutex or release a mutex with no waiters, this can be a single atomic op that doesn’t use futex at all. When you are contended, futex gives you a way of blocking until a memory word changes or waking threads that have blocked on a specific word.

            As the article describes, you can use it to implement a mutex. You can also use it for counting semaphores or for condition variables. It’s also flexible enough that you can use it for other things that don’t look quite like locks, For example, if you have a fixed-size lockless ring buffer with producer and consumer counters then you ideally want to sleep in the producer thread when the ring is full and in the consumer thread when it’s full. You can use a futex wait operation on the consumer and producer threads for this, respectively. This is much simpler than building the same mechanism out of a mutex and a condition variable. That kind of data structure will typically live in a concurrency library though, you probably don’t want to implement it yourself.

          2. 1

            They’re at approximately the same level of abstraction imo, when you’re looking at application design without diving deep into performance, but futex is the one that’s a system call.

            Here’s a concrete example of using it to implement a mutex: https://github.com/rust-lang/rust/blob/9881574/library/std/src/sys/unix/locks/futex_mutex.rs#L25-L61

            Note that line 56 and line 94 are the only ones that actually uses the futex syscall - the rest are all atomic operations.

    5. 1

      Linux and OpenBSD both support a “requeue” operation which helps avoid the thundering herd problem in condition variable broadcasts. I think FreeBSD doesn’t, and idk about the other platforms.

      1. 2

        If I understand this correctly, it doesn’t apply on FreeBSD because the userspace condition variable is implemented using the umtx condition variable operations, not the ones that operate on a single word. In particular, this call that blocks on a condition variable takes the mutex as a second argument and so the kernel can avoid waking multiple threads from the broadcast that would then immediately sleep trying to reacquire the mutex. This is a problem on Linux because the kernel doesn’t know about the relationship between the mutex and condition variable and libc has to handle them as entirely separate variables.