LLVM will now remove stores to constant memory (since this is a contradiction) under the assumption the code in question must be dead.
I don’t understand why* it would detect and prune stores to constant memory instead of simply refusing to compile them.
This “user made a mistake (invoking undefined behaviour)” => “compiler can do what it wants” behaviour is bizarre to me, as a C and C++ outsider, and it’s interesting to see LLVM also following this mantra. Surely it will force other LLVM-based projects to take C and C++ semantics into consideration when they otherwise wouldn’t need to.
*Edit: Okay, I do understand why from an optimisation view, but it’s a decision that pushes responsibility onto the frontend (you have to get constness right) while simultaneously removing control (you don’t get to decide what happens if you get it wrong).
It’s an architectural problem with compilers. The front-end sees the program at the syntax level (or a lightly processed version of it), and not the actual data flow. So the front-end can only catch most obvious violations, and not violations hidden behind indirection and conditionals.
The optimizer performs more in-depth analysis and performs passes that simplify/remove code to uncover more complex cases of UB. But at that point it’s unclear whether that UB came from the program, or has been created as side-effect to clean up after other transformations, simplifications and optimization passes.
While assignment to const is probably detectable directly, C has a lot of UB that isn’t detectable in any useful way, e.g. a + b is UB if the operation is signed and overflows. The compiler can’t warn about every use of +, but it does handle UB for every such case, and treating it as UB is critical to performance of indexing by int in loops.
Hmm, is there really no way for it to feed this kind of information back to the front-end? Otherwise every language using LLVM would inherit UB.
But at that point it’s unclear whether that UB came from the program, or has been created as side-effect
Yikes. Should we expect that to happen often?
With the integer overflow example, I’d appreciate if the compiler optimised loops where the index can be shown to be within bounds (not hard with a regular for loop), and complained or added bounds checking in other cases (where the code probably needs cleaning up anyway). No UB required, I think.
Otherwise every language using LLVM would inherit UB.
Yes! And for example, Rust does inherit UB from LLVM (the safe Rust tries very hard not to emit any constructs that LLVM could consider UB, but bugs around that have happened).
In C void foo(item *arr, int length) {for(int i=0; i < length; i++) arr[i] = 0;} cannot be proven to be in bounds, or free of integer overflow if sizeof(item)>1. And that’s a textbook loop example.
This is interesting, in a RISKS Digest sort of way:
LLVM will now remove stores to constant memory (since this is a contradiction) under the assumption the code in question must be dead. This has proven to be problematic for some C/C++ code bases which expect to be able to cast away ‘const’. This is (and has always been) undefined behavior, but up until now had not been actively utilized for optimization purposes in this exact way. For more information, please see: bug 42763 and post commit discussion.
The post-commit discussion links to code in BusyBox’s ash which would fail if built by a clang based on this llvm version. I suppose the RISK is a build breaking with a newer clang, people cursing and going back to the older clang instead of fixing their code, and missing out on improved bug reporting or similar.
Interesting indeed, to be honest, I’m not totally sure of the value of an optimization like that. Is there any case that would actually improve performance and not just break things?
I don’t understand why* it would detect and prune stores to constant memory instead of simply refusing to compile them.
This “user made a mistake (invoking undefined behaviour)” => “compiler can do what it wants” behaviour is bizarre to me, as a C and C++ outsider, and it’s interesting to see LLVM also following this mantra. Surely it will force other LLVM-based projects to take C and C++ semantics into consideration when they otherwise wouldn’t need to.
*Edit: Okay, I do understand why from an optimisation view, but it’s a decision that pushes responsibility onto the frontend (you have to get constness right) while simultaneously removing control (you don’t get to decide what happens if you get it wrong).
It’s an architectural problem with compilers. The front-end sees the program at the syntax level (or a lightly processed version of it), and not the actual data flow. So the front-end can only catch most obvious violations, and not violations hidden behind indirection and conditionals.
The optimizer performs more in-depth analysis and performs passes that simplify/remove code to uncover more complex cases of UB. But at that point it’s unclear whether that UB came from the program, or has been created as side-effect to clean up after other transformations, simplifications and optimization passes.
While assignment to
const
is probably detectable directly, C has a lot of UB that isn’t detectable in any useful way, e.g.a + b
is UB if the operation is signed and overflows. The compiler can’t warn about every use of+
, but it does handle UB for every such case, and treating it as UB is critical to performance of indexing byint
in loops.Hmm, is there really no way for it to feed this kind of information back to the front-end? Otherwise every language using LLVM would inherit UB.
Yikes. Should we expect that to happen often?
With the integer overflow example, I’d appreciate if the compiler optimised loops where the index can be shown to be within bounds (not hard with a regular
for
loop), and complained or added bounds checking in other cases (where the code probably needs cleaning up anyway). No UB required, I think.Yes! And for example, Rust does inherit UB from LLVM (the safe Rust tries very hard not to emit any constructs that LLVM could consider UB, but bugs around that have happened).
In C
void foo(item *arr, int length) {for(int i=0; i < length; i++) arr[i] = 0;}
cannot be proven to be in bounds, or free of integer overflow ifsizeof(item)>1
. And that’s a textbook loop example.Er, there’s no integer overflow there. The compiler had best figure out an offset that works.
arr[i]
is reading addressarr + sizeof(item) * i
, and the address computation can overflow.https://gist.github.com/rygorous/e0f055bfb74e3d5f0af20690759de5a7
I don’t believe kornel claimed that overflow would occur; they claim that the compiler is not able to prove that it cannot occur.
This is interesting, in a RISKS Digest sort of way:
The post-commit discussion links to code in BusyBox’s
ash
which would fail if built by aclang
based on thisllvm
version. I suppose the RISK is a build breaking with a newerclang
, people cursing and going back to the olderclang
instead of fixing their code, and missing out on improved bug reporting or similar.Interesting indeed, to be honest, I’m not totally sure of the value of an optimization like that. Is there any case that would actually improve performance and not just break things?
Less code → room for other code in L1i cache
But why would any correct code be writing to const memory? It doesn’t make sense to optimize incorrect code anyway.