https://lists.llvm.org/pipermail/llvm-dev/2021-June/151199.html <- link to the actual article
http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2676.pdf <- link to the paper describing the memory model
We’re linking to twitter “threads” now? :(
I was agonizing about what to link here. The interesting thing is the email, but it starts in the middle. The missing “intro paragraph” is this tweet. My fear was that, should I link the email directly, folks would scan just the first screen full of text and bail out, not getting to the “Okay, so let me try to restate and summarize all this” line.
What would have been the better way to share the content in this case?
Twitter content is on topic - or rather, on-topic content on Twitter is not disallowed simply because it’s on Twitter.
You did the right thing here.
I don’t know, but twitter uses so much malware that I can’t even view the site in Firefox. And I don’t have a crazy amount of anti-tracking stuff running.
Soon we’ll be linking to an Instagram post of multiple screenshots of a twitter “thread”.
It really is a shame that so many technology discussions happen on Twitter of all places. I get FOMO about it sometimes but I just can’t be bothered to deal with the dysfunctional UI. I once followed a Columbia CS professor to follow his research, but I had no choice but to also see him posting photos of spoiled food asking “is this avocado with brown spots still safe to eat?”
More explanations of the problem:
I have been following this saga for years, and it is useful to revisit 2015 paper where it all started, as far as I can tell. As far as I know, this is where the idea “ptrtoint has side-effects and can’t be removed even if unused” originated.
Am I the only one who feels like the problem is LLVM is using a flawed abstraction here, and really shouldn’t be trying to dictate how C should work?
Not sure, it seems that the core question is “which optimizations are allowed?”, and that seems independent from any particular backend. The example is quite clear:
int x = 0;
int *p = guess_address_of_x();
*p = 15;
printf(“%d\n”, x); // provably 0?
If we want to be able to optimize x to 0, we need a language-level legalism explaining why is it valid. And “no, such optimizations are not valid” could be a reasonable outcome.
The core problem is that C can’t decide if it’s BCPL or a modern language.
In BCPL, every variable was a word and some words could be either used directly as values or treated as addresses for loads and stores. In C, pointers are a distinct type from integers. C does not actually specify what happens when you cast a pointer to an integer at all in C99 or C11. It specifies that an intptr_t type may exist (implementation defined) and that if it does exist then a pointer cast to an intptr_t and back again must be usable in exactly the same situations as the original pointer. Any other operation on the intptr_t holding a pointer is unspecified (and any cast of a pointer to, for example, a long is also unspecified).
In practice; however, C code expects to cast pointers to intptr_t (or long or size_t in older codebases), do some arithmetic, and cast back. LLVM has to handle this.
Alias analysis is one of the core building blocks for most compiler optimisations. If you can prove that two pointers don’t alias, then operations on them are independent. They can be reordered, stores can be forwarded to loads, and so on. This is a huge win for C over BCPL. C places some restrictions on pointers by having a notion of object-based memory. Every C pointer points to one object. If two pointers point to different objects then alias analysis may assume that no arithmetic on one can ever end up with a pointer to the other. Casts from integers to pointers provide an escape hatch here: LLVM assumes that any inttoptr instruction may point to any object and so is very conservative with alias analysis.
There are some complex C idioms that make this painful. For example, an XOR linked list works by having a single pointer in each node that is the XOR of the prior and next nodes’ addresses. This lets you traverse the list in either direction, like a double linked list, but with a single word of state for the next/previous pointer in each node. Standard C is silent on whether this is actually allowed, but it’s a thing that C compilers generally accept and so it’s part of the de-facto C standard. This should be a tighter restriction than the ‘can-point-to-anything’ assumption: The pointer that is made by XORing two pointers can (in a multi-provenance model, as proposed for C2x) be converted back to either pointer, but not to a pointer to any other object in the system.
This gets a lot more interesting with unusual hardware. CHERI systems, such as Arm’s Morello, enforce a single-provenance semantics. The hardware provides a type that can be used to represent a pointer and enforces guarantees that they:
This means that pointer-to-integer casts are fine, integer-to-pointer casts are not. We represent intptr_t in this system as a pointer type in LLVM IR, so preserve the provenance ‘for free’.