This is very annoying such things are not solved yet. Basically we still have no proper C and C++ standards. We still don’t have proper model for parallel execution. And problems mentioned in the paper cause actual miscompilations on Apple M1 (see https://github.com/llvm/llvm-project/issues/64188 )
It is annoying! And I think this is in part a failure of the approach to standardization that C and C++ chose.
Their standards are written in prose. It’s very dense and technical prose, but it’s still not nearly formal enough to actually prove things about. Surprising interpretations (such as out-of-thin-air reads and their weirder cousin, out-of-never-taken-branch reads) of this prose are still being found.
Some projects that want formal, mechanized proofs about C code such as seL4 make up their own mechanized semantics for C, and then prove that the resulting asm implements those semantics.
Yuck. I don’t understand why #64188 is a bug. There’s no ordering requirement on the stores to *p or X so surely the CPU can reorder them as it wishes. (Normally no TSO on ARM!) And I thought acquire/release ops are ordered wrt each other, but not necessarily anything else; I thought you need a memory barrier to impose an ordering on non-atomic and relaxed ops. So if the CPU can reorder surely it’s OK for the compiler to reorder.
And I thought acquire/release ops are ordered wrt each other, but not necessarily anything else; I thought you need a memory barrier to impose an ordering on non-atomic and relaxed ops.
No, that’s not right. You can write a correct Mutex that guards a non-atomic critical section using only acquire/release ops. That’s their classic use-case and also how they got their name. Specifically: If you have
and the acquire load on thread 2 reads-from the release store on thread 1, then the non-atomic stuff on Thread 1 happens-before the non-atomic stuff on Thread 2.
(Edit: for mutual exclusion, you’d need not just loads/stores but atomic RMW ops like compare-and-swap. However, the orderings induced on non-atomics work the same way.)
In fact, the opposite is true: By themselves, barriers can never impose ordering on non-atomic ops - if you have a program with only barriers and no atomic ops, the barriers are useless. You need two atomic ops that synchronize with each other to establish an inter-thread-happens-before relationship at some point. Then, barriers sequenced-before or -after those atomics on their respective threads can have an effect.
Applying this to the code from the bug report, we have thread1 doing
*p = i;
foo(i, a);
where foo does a store-release through a and thread 2 doing
while (a.load(memory_order_acquire) < N);
which eventually reads-from that release store and thus establishes the inter-thread-happens-before relationship.
The non-atomic *p = i; on thread 1 is sequenced-before the release store, which synchronizes-with the acquire load on thread 2, which is sequenced before the relaxed load of p. Therefore, the non-atomic store inter-thread happens before that relaxed load, and the load must read either i or another value stored to *p at a later point in its modification order. Since there is no such other store, it must read i.
Yes, and in your defence: the C/C++ fences are quite different from barriers in most instruction sets. The latter can in many cases actually be used to provide synchronization by themselves.
Ack! I get paid the big bux to code in C++, but I got lost really quickly here.
At runtime, T1’s CPU can reorder the store to ptr_to_int_2 in front of the rest of t1
How can that happen? That store copies the value of p, which doesn’t get assigned a value until the call to f() on the 2nd line. This reordering would write an uninitialized value to ptr_to_int_2 … right?
Their wording is a bit unfortunate here. I believe they mean “the CPU reorders the visible effect (as seen by the other thread) to before the rest”. The code may execute perfectly sequential locally on the CPU, but t2 can have a different view due to weak memory.
Specifically, the CPU optimization at work here is called “load buffering”, where loads can be delayed in time until their value is actually needed. It is mind-warping, causality-wrecking and highly surprising, but implemented by some ARMv8 CPUs and permitted by the standard for relaxed atomics.
int* q = ptr_to_int_1.load(relaxed); // maybe this is really slow
// so the cpu runs ahead out of order
// inside f()
int* p = new int;
*p = 123;
*q = 456;
some_extern_func(*p); // this function is small
// so the cpu can keep running past it
// f() returns
ptr_to_int_2.store(p, relaxed); // nothing stops this completing early
The reads and writes to *p and *q depend on the value of q so they need to run in program order, but the relaxed store can complete first.
This is very annoying such things are not solved yet. Basically we still have no proper C and C++ standards. We still don’t have proper model for parallel execution. And problems mentioned in the paper cause actual miscompilations on Apple M1 (see https://github.com/llvm/llvm-project/issues/64188 )
It is annoying! And I think this is in part a failure of the approach to standardization that C and C++ chose.
Their standards are written in prose. It’s very dense and technical prose, but it’s still not nearly formal enough to actually prove things about. Surprising interpretations (such as out-of-thin-air reads and their weirder cousin, out-of-never-taken-branch reads) of this prose are still being found.
Some projects that want formal, mechanized proofs about C code such as seL4 make up their own mechanized semantics for C, and then prove that the resulting asm implements those semantics.
Yuck. I don’t understand why #64188 is a bug. There’s no ordering requirement on the stores to
*porXso surely the CPU can reorder them as it wishes. (Normally no TSO on ARM!) And I thought acquire/release ops are ordered wrt each other, but not necessarily anything else; I thought you need a memory barrier to impose an ordering on non-atomic and relaxed ops. So if the CPU can reorder surely it’s OK for the compiler to reorder.No, that’s not right. You can write a correct Mutex that guards a non-atomic critical section using only acquire/release ops. That’s their classic use-case and also how they got their name. Specifically: If you have
and the acquire load on thread 2 reads-from the release store on thread 1, then the non-atomic stuff on Thread 1 happens-before the non-atomic stuff on Thread 2. (Edit: for mutual exclusion, you’d need not just loads/stores but atomic RMW ops like compare-and-swap. However, the orderings induced on non-atomics work the same way.)
In fact, the opposite is true: By themselves, barriers can never impose ordering on non-atomic ops - if you have a program with only barriers and no atomic ops, the barriers are useless. You need two atomic ops that synchronize with each other to establish an inter-thread-happens-before relationship at some point. Then, barriers sequenced-before or -after those atomics on their respective threads can have an effect.
Applying this to the code from the bug report, we have thread1 doing
where
foodoes a store-release throughaand thread 2 doingwhich eventually reads-from that release store and thus establishes the inter-thread-happens-before relationship. The non-atomic
*p = i;on thread 1 is sequenced-before the release store, which synchronizes-with the acquire load on thread 2, which is sequenced before the relaxed load of p. Therefore, the non-atomic store inter-thread happens before that relaxed load, and the load must read eitherior another value stored to*pat a later point in its modification order. Since there is no such other store, it must readi.Thanks. Amazing how easy it is to forget these details when I haven’t used them for a while!
Yes, and in your defence: the C/C++ fences are quite different from barriers in most instruction sets. The latter can in many cases actually be used to provide synchronization by themselves.
Ack! I get paid the big bux to code in C++, but I got lost really quickly here.
How can that happen? That store copies the value of p, which doesn’t get assigned a value until the call to f() on the 2nd line. This reordering would write an uninitialized value to ptr_to_int_2 … right?
Their wording is a bit unfortunate here. I believe they mean “the CPU reorders the visible effect (as seen by the other thread) to before the rest”. The code may execute perfectly sequential locally on the CPU, but t2 can have a different view due to weak memory.
Specifically, the CPU optimization at work here is called “load buffering”, where loads can be delayed in time until their value is actually needed. It is mind-warping, causality-wrecking and highly surprising, but implemented by some ARMv8 CPUs and permitted by the standard for relaxed atomics.
It works if the
new intis also reordered, likeThe reads and writes to
*pand*qdepend on the value of q so they need to run in program order, but the relaxed store can complete first.