It occurs to me: one of the bottlenecks the author points out is that the CPU has to acquire locks on cache lines in order to do memory operations. It seems like a smartly designed CPU could expose this and obviate the whole spinlock implementation by just letting you use the cache line lock directly.
It’s possible I’m missing something obvious, with this being somewhat lower level than my areas of expertise, but this seems like a symptom of modern CPUs trying to expose the interface of a fast PDP-7. Ah legacy.
I’m pretty sure that the ttas_lock implementation is wrong. The relaxed memory atomic store does not establish a happens-before relationship with any other thread and so unless something else does then there is no requirement that the unlock will ever be observed if it occurs in another thread. The only requirement that a relaxed load provides is that there will be no tearing: if there is another store then the load is guaranteed to see either the ‘before’ or ‘after’ version, but not any intermediate value. It is completely fine for it to always see the before value. A compiler would be permitted to optimise this to a single load and then an infinite loop.
The pause instruction is not particularly related to hyperthreading (though it does impact hyperthreading), it is to do with speculative execution. Typically, if the lock is held, the branch predictor will predict that the second time around the loop, the loop won’t exit. It will then make the same prediction for all subsequent iterations. You end up filling the pipeline with loads and have to flush them all when you exit the loop. The pause instruction tells the CPU to (try to) not speculatively execute the load until it sees a store from another cache line.
Note that the above is true for Arm and x86. The RISC-V pause instruction, which is being rushed through the standards process, specifies that the HART should stop issuing instructions for an implementation-defined number of cycles and so is useless for spinlocks.
It is completely fine for it to always see the before value. A compiler would be permitted to optimise this to a single load and then an infinite loop.
It’s fundamental that atomic loads and stores can never by elided in the C++11 memory model: “Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.” http://eel.is/c++draft/atomics.order#11
The pause instruction is not particularly related to hyperthreading (though it does impact hyperthreading), it is to do with speculative execution.
This might have been true in the past, sometimes even Intel’s manuals are wrong. Pause instruction is related to hyper threading in that it prevents a spinlock from stuffing the pipeline full of loads and starving the other HT of load/store resources. If you don’t have HT enabled, a spinlock without pause instruction will have lower latency in my benchmarks, so avoiding the pipeline flush is of no advantage in that case.
As a disclaimer I have to state that user-space spinlocks are only useful for specific applications with a thread-per-core architecture.
It’s fundamental that atomic loads and stores can never by elided in the C++11 memory model
I’d refer you to Peter Sewell and his group’s work (particularly Mark Batty’s CPPMem) in this space. There are cases where the compiler may elide atomics. Two sequential stores to the same address with no operations in the middle that establish a happens-before relationship may be elided because the visible outcome from other threads is the same. Other threads must see either the before value from both stores, the after value, of the one in the middle, but there is no requirement that any of them ever see the middle value. As such, the middle value can be elided.
_Atomic is not the same as volatile. An atomic operation has a well-defined memory model and the compiler is free to elide anything that does not violate that memory model. A volatile store may not be elided because it is assumed to be an write to a memory-mapped I/O space that is explicitly outside the C/C++ abstract machine.
Most compilers don’t perform these optimisations, but the model allows them. Some Arm implementations are a lot more aggressive about their support for the weak model and really won’t make these things visible until you do some kind of barrier instruction. In these cases, you’ll probably spin until the next context switch (and then get the notification because the contest switch code included some memory barriers).
Implementations should make atomic stores visible to atomic loads within a reasonable amount of time
And yet the spec doesn’t specify what ‘reasonable’ means. This can be a surprisingly long time even with a naïve lowering in the compiler on a single socket machine. If you consider distributed shared memory systems, reasonable may be 10 seconds. Note also the ‘should’: This is a recommendation to implementers, it is not required by the model.
If you don’t have HT enabled, a spinlock without pause instruction will have lower latency in my benchmarks, so avoiding the pipeline flush is of no advantage in that case.
New Intel CPUs may have special microcode to detect spinlocks, so this may no longer be the case:
You are correct if you consider only very local properties. If the pipeline is packed then the CPU remains in a high-power state. This means that you’re more likely to enter a thermal throttling state. If you’re telling the CPU explicitly to pause then you’re dissipating less power as heat while waiting and so more likely to be able to run at full frequency when you resume.
Two sequential stores to the same address with no operations in the middle that establish a happens-before relationship may be elided because the visible outcome from other threads is the same.
Fair enough, but it doesn’t change the fact that the last stored value needs to be committed to memory at some point.
And yet the spec doesn’t specify what ‘reasonable’ means. This can be a surprisingly long time even with a naïve lowering in the compiler on a single socket machine. If you consider distributed shared memory systems, reasonable may be 10 seconds. Note also the ‘should’: This is a recommendation to implementers, it is not required by the model.
The time doesn’t matter, it only means that .load(std::m_o_r) must eventually observe the store. If you interpret this ‘should’ as ‘may’ the memory model becomes useless since different threads would never see each others memory updates. It seems to me you are deflecting instead of arguing in good faith. I do think the memory model could be worded less ambiguously in the spec.
You are correct if you consider only very local properties. If the pipeline is packed then the CPU remains in a high-power state. This means that you’re more likely to enter a thermal throttling state. If you’re telling the CPU explicitly to pause then you’re dissipating less power as heat while waiting and so more likely to be able to run at full frequency when you resume.
Yeah that’s exactly why the latency is worse when using pause, wake up from lowe power states. In my application we set SSE, AVX and Turbo frequency multiples to the same values and run our servers at a constant overclock to avoid the frequency transition jitter. You always need to measure and tune for your application. And don’t always trust what the Intel manual says.
The time doesn’t matter, it only means that .load(std::m_o_r) must eventually observe the store. If you interpret this ‘should’ as ‘may’ the memory model becomes useless since different threads would never see each others memory updates. It seems to me you are deflecting instead of arguing in good faith. I do think the memory model could be worded less ambiguously in the spec.
The memory model is not ambiguous in this regard: A relaxed load or store does not establish a happens-before relationship between the thread doing this operation and any other thread. Without a happens-before relationship, because the C/C++ abstract machine has no intrinsic notion of time for operations, there is no guarantee of forward progress.
A spinlock is intrinsically about establishing a happens-before relationship between two or more threads: The things that happen in one thread with the lock held must be sequenced with respect to things that happen in any other thread with the lock held. The formal models of the spec that have been blessed by WG21 are explicit about this: you cannot use a relaxed-consistency load for any kind of lock unless you add an explicit thread fence as well.
It may work on a given architecture / compiler. It almost certainly will work on a TSO architectures such as x86 but that doesn’t mean that it’s correct with respect to the C++11 memory model. On x86, the same operation without any atomics will also work there: if you transform the load from an atomic load into a non-atomic load with a signal fence in the loop, then it will also work correctly on x86. The signal fence prevents the compiler from eliding (or reordering) the load, because the variable may be modified in the signal handler and the fact that the underlying hardware is TSO means that your non-atomic load is establishing a happens-before relationship at the ISA level and so must see other loads and stores. On x86, acquire and relaxed memory order loads compile to exactly the same instruction, on other architectures this is not the case (for example, on Arm you will see ldar vs ldr and within the Arm memory model if a core doesn’t do any barriers or acquire / release operations then there is no guarantee that it will ever observe memory modification from another core).
The memory model is not ambiguous in this regard: A relaxed load or store does not establish a happens-before relationship between the thread doing this operation and any other thread. Without a happens-before relationship, because the C/C++ abstract machine has no intrinsic notion of time for operations, there is no guarantee of forward progress.
This a complete digression. My example TTAS implementation uses .exchange(m_o_acquire) and .store(m_o_release) to establish a happens-before relationship, that’s not ambiguous at all. Your original claim was that .load(m_o_relexed) may never see any store. That’s wrong according to the spec (http://eel.is/c++draft/atomics.order#11). Stores must be made visible to loads eventually, otherwise your shared memory machine is pretty crappy. Furthermore I worked through and applied the definitions here http://eel.is/c++draft/intro.multithread to the TTAS lock and confirmed that the relaxed load cannot be re-ordered in respect to the .exchange(m_o_acquire) and .store(m_o_release).
The 31.4.11 is definitely ambiguous since we interpret it differently. Anyway current compilers all follow my interpretation so in practice this works for now…
What’s interesting is that to implement ticketlock efficiently on amd64 you want to use a 32bit fetch_add to acquire a ticket and grab the current processing ticket as a single op and fast path for the non-contended case, but use a 16bit store to release the lock. This way we can use a weaker non-RMW op to release the lock. I don’t think this is possible under the C++11 memory model. Even with atomic_ref you cannot mix operations of different sizes to same address.
The relaxed memory atomic store does not establish a happens-before relationship with any other thread and so unless something else does then there is no requirement that the unlock will ever be observed if it occurs in another thread.
Hm, regardless of whether the conclusion is true, I don’t think this specific logic works. Seems like it also proves that tas lock lacks liveness as well.
You only get happens before when you actually read a value written by another thread. You can’t use Release/Acquire on the atomic itself to guarantee that you’ll eventually read a new value — that would be circular. I think something else in C memory model guarantees forward progress in such cases. Would love to know what that something else is though!
Acquire/release and ordering of non-atomic load/stores with respect to them is defined for the abstract C++ machine. If you read the specs you see that it implies that a compiler needs to emit loads/stores and barriers in a specific order depending the CPU architecture.
It occurs to me: one of the bottlenecks the author points out is that the CPU has to acquire locks on cache lines in order to do memory operations. It seems like a smartly designed CPU could expose this and obviate the whole spinlock implementation by just letting you use the cache line lock directly.
It’s possible I’m missing something obvious, with this being somewhat lower level than my areas of expertise, but this seems like a symptom of modern CPUs trying to expose the interface of a fast PDP-7. Ah legacy.
Intel has been working on that: Transactional Synchronization Extensions (TSX) (https://en.wikipedia.org/wiki/Transactional_Synchronization_Extensions). So far it has been plagued by hardware bugs.
I’m pretty sure that the
ttas_lock
implementation is wrong. The relaxed memory atomic store does not establish a happens-before relationship with any other thread and so unless something else does then there is no requirement that the unlock will ever be observed if it occurs in another thread. The only requirement that a relaxed load provides is that there will be no tearing: if there is another store then the load is guaranteed to see either the ‘before’ or ‘after’ version, but not any intermediate value. It is completely fine for it to always see the before value. A compiler would be permitted to optimise this to a single load and then an infinite loop.The pause instruction is not particularly related to hyperthreading (though it does impact hyperthreading), it is to do with speculative execution. Typically, if the lock is held, the branch predictor will predict that the second time around the loop, the loop won’t exit. It will then make the same prediction for all subsequent iterations. You end up filling the pipeline with loads and have to flush them all when you exit the loop. The pause instruction tells the CPU to (try to) not speculatively execute the load until it sees a store from another cache line.
Note that the above is true for Arm and x86. The RISC-V pause instruction, which is being rushed through the standards process, specifies that the HART should stop issuing instructions for an implementation-defined number of cycles and so is useless for spinlocks.
It’s fundamental that atomic loads and stores can never by elided in the C++11 memory model: “Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.” http://eel.is/c++draft/atomics.order#11
This might have been true in the past, sometimes even Intel’s manuals are wrong. Pause instruction is related to hyper threading in that it prevents a spinlock from stuffing the pipeline full of loads and starving the other HT of load/store resources. If you don’t have HT enabled, a spinlock without pause instruction will have lower latency in my benchmarks, so avoiding the pipeline flush is of no advantage in that case.
As a disclaimer I have to state that user-space spinlocks are only useful for specific applications with a thread-per-core architecture.
I’d refer you to Peter Sewell and his group’s work (particularly Mark Batty’s CPPMem) in this space. There are cases where the compiler may elide atomics. Two sequential stores to the same address with no operations in the middle that establish a happens-before relationship may be elided because the visible outcome from other threads is the same. Other threads must see either the before value from both stores, the after value, of the one in the middle, but there is no requirement that any of them ever see the middle value. As such, the middle value can be elided.
_Atomic
is not the same asvolatile
. An atomic operation has a well-defined memory model and the compiler is free to elide anything that does not violate that memory model. Avolatile
store may not be elided because it is assumed to be an write to a memory-mapped I/O space that is explicitly outside the C/C++ abstract machine.Most compilers don’t perform these optimisations, but the model allows them. Some Arm implementations are a lot more aggressive about their support for the weak model and really won’t make these things visible until you do some kind of barrier instruction. In these cases, you’ll probably spin until the next context switch (and then get the notification because the contest switch code included some memory barriers).
And yet the spec doesn’t specify what ‘reasonable’ means. This can be a surprisingly long time even with a naïve lowering in the compiler on a single socket machine. If you consider distributed shared memory systems, reasonable may be 10 seconds. Note also the ‘should’: This is a recommendation to implementers, it is not required by the model.
New Intel CPUs may have special microcode to detect spinlocks, so this may no longer be the case:
You are correct if you consider only very local properties. If the pipeline is packed then the CPU remains in a high-power state. This means that you’re more likely to enter a thermal throttling state. If you’re telling the CPU explicitly to pause then you’re dissipating less power as heat while waiting and so more likely to be able to run at full frequency when you resume.
Fair enough, but it doesn’t change the fact that the last stored value needs to be committed to memory at some point.
The time doesn’t matter, it only means that
.load(std::m_o_r)
must eventually observe the store. If you interpret this ‘should’ as ‘may’ the memory model becomes useless since different threads would never see each others memory updates. It seems to me you are deflecting instead of arguing in good faith. I do think the memory model could be worded less ambiguously in the spec.Yeah that’s exactly why the latency is worse when using pause, wake up from lowe power states. In my application we set SSE, AVX and Turbo frequency multiples to the same values and run our servers at a constant overclock to avoid the frequency transition jitter. You always need to measure and tune for your application. And don’t always trust what the Intel manual says.
The memory model is not ambiguous in this regard: A relaxed load or store does not establish a happens-before relationship between the thread doing this operation and any other thread. Without a happens-before relationship, because the C/C++ abstract machine has no intrinsic notion of time for operations, there is no guarantee of forward progress.
A spinlock is intrinsically about establishing a happens-before relationship between two or more threads: The things that happen in one thread with the lock held must be sequenced with respect to things that happen in any other thread with the lock held. The formal models of the spec that have been blessed by WG21 are explicit about this: you cannot use a relaxed-consistency load for any kind of lock unless you add an explicit thread fence as well.
It may work on a given architecture / compiler. It almost certainly will work on a TSO architectures such as x86 but that doesn’t mean that it’s correct with respect to the C++11 memory model. On x86, the same operation without any atomics will also work there: if you transform the load from an atomic load into a non-atomic load with a signal fence in the loop, then it will also work correctly on x86. The signal fence prevents the compiler from eliding (or reordering) the load, because the variable may be modified in the signal handler and the fact that the underlying hardware is TSO means that your non-atomic load is establishing a happens-before relationship at the ISA level and so must see other loads and stores. On x86, acquire and relaxed memory order loads compile to exactly the same instruction, on other architectures this is not the case (for example, on Arm you will see
ldar
vsldr
and within the Arm memory model if a core doesn’t do any barriers or acquire / release operations then there is no guarantee that it will ever observe memory modification from another core).This a complete digression. My example TTAS implementation uses
.exchange(m_o_acquire)
and.store(m_o_release)
to establish a happens-before relationship, that’s not ambiguous at all. Your original claim was that.load(m_o_relexed)
may never see any store. That’s wrong according to the spec (http://eel.is/c++draft/atomics.order#11). Stores must be made visible to loads eventually, otherwise your shared memory machine is pretty crappy. Furthermore I worked through and applied the definitions here http://eel.is/c++draft/intro.multithread to the TTAS lock and confirmed that the relaxed load cannot be re-ordered in respect to the.exchange(m_o_acquire)
and.store(m_o_release)
.The 31.4.11 is definitely ambiguous since we interpret it differently. Anyway current compilers all follow my interpretation so in practice this works for now…
What’s interesting is that to implement ticketlock efficiently on amd64 you want to use a 32bit fetch_add to acquire a ticket and grab the current processing ticket as a single op and fast path for the non-contended case, but use a 16bit store to release the lock. This way we can use a weaker non-RMW op to release the lock. I don’t think this is possible under the C++11 memory model. Even with
atomic_ref
you cannot mix operations of different sizes to same address.Hm, regardless of whether the conclusion is true, I don’t think this specific logic works. Seems like it also proves that tas lock lacks liveness as well.
You only get happens before when you actually read a value written by another thread. You can’t use Release/Acquire on the atomic itself to guarantee that you’ll eventually read a new value — that would be circular. I think something else in C memory model guarantees forward progress in such cases. Would love to know what that something else is though!
Yes: http://eel.is/c++draft/atomics.order#11
Acquire/release and ordering of non-atomic load/stores with respect to them is defined for the abstract C++ machine. If you read the specs you see that it implies that a compiler needs to emit loads/stores and barriers in a specific order depending the CPU architecture.
I liked the article, never thought before about read-only being faster even though it’s more racy. The reddit thread is also pretty interesting.