My second guess is that these futexes aren’t implemented efficiently in the standard library yet
Windows has futexes of 1, 2, 4, and 8 bytes. Linux currently only has 4-byte futexes. So I don’t find it surprising that the pointer-sized (8-byte) and 1-byte mutexes performed badly on the latter.
uint64_t old = pointer.load(std::memory_order_relaxed);
while (!pointer.compare_exchange_weak(old, (old & both_bits) | as_int, std::memory_order_relaxed))
This seems rather odd to me. Of course, it was never explicitly stated what guarantees the data structure should provide. But I would expect that, if I take a lock on a pointer, no one can change that pointer while I still hold the lock.
So if we only need two bits for a mutex, can we store four mutexes in a single byte?
That sounds like it would lead to unheard-of rates of false sharing. EDIT: hayley points out, much to my chagrin, that since the mutex only has 3 states, you can actually pack in 5 of them per byte.
I haven’t looked at the futex wrapper for libstdc++, but the libc++ is implemented in a very silly way that make supporting the variable sized APIs on Windows an ABI break. I intend to fix it at some point.
Unfortunately, the C++ standard is also specified in a very silly way. It prohibits spurious wakes, which means that it must be vulnerable to ABA issues. There is no possible way of implementing such a system that is not vulnerable to at least one of these and, since the underlying kernel APIs on all platforms favour spurious wakes you get the worst of both worlds.
On Linux, there are actually futex APIs that let you wake a set of sleepers based on the bits in a futex word. I haven’t seen them anywhere else. They’re useful for situations where you need to be able to atomically acquire and release a set of them.
The thing that this analysis missed is false sharing. Often, mutexes are rounded up to cache line sized (or put in the same cache line as the thing that they’re protecting) so that locking one mutex doesn’t involve contention on another. This is not possible with sub-word mutexes.
Often you don’t need to actually lock access to any combination of members of a table. If you only need to lock access to n rows at once, you can get away with a list of n mutexes, each of which also stores the key that it locks.
Windows has futexes of 1, 2, 4, and 8 bytes. Linux currently only has 4-byte futexes. So I don’t find it surprising that the pointer-sized (8-byte) and 1-byte mutexes performed badly on the latter.
This seems rather odd to me. Of course, it was never explicitly stated what guarantees the data structure should provide. But I would expect that, if I take a lock on a pointer, no one can change that pointer while I still hold the lock.
That sounds like it would lead to unheard-of rates of false sharing. EDIT: hayley points out, much to my chagrin, that since the mutex only has 3 states, you can actually pack in 5 of them per byte.
I haven’t looked at the futex wrapper for libstdc++, but the libc++ is implemented in a very silly way that make supporting the variable sized APIs on Windows an ABI break. I intend to fix it at some point.
Unfortunately, the C++ standard is also specified in a very silly way. It prohibits spurious wakes, which means that it must be vulnerable to ABA issues. There is no possible way of implementing such a system that is not vulnerable to at least one of these and, since the underlying kernel APIs on all platforms favour spurious wakes you get the worst of both worlds.
On Linux, there are actually futex APIs that let you wake a set of sleepers based on the bits in a futex word. I haven’t seen them anywhere else. They’re useful for situations where you need to be able to atomically acquire and release a set of them.
The thing that this analysis missed is false sharing. Often, mutexes are rounded up to cache line sized (or put in the same cache line as the thing that they’re protecting) so that locking one mutex doesn’t involve contention on another. This is not possible with sub-word mutexes.
Often you don’t need to actually lock access to any combination of members of a table. If you only need to lock access to n rows at once, you can get away with a list of n mutexes, each of which also stores the key that it locks.