LLVM IR is also binary format (sometimes called “bitcode”), although we will be working exclusively with its text format (which uses the .ll extension).
I would express this slightly differently. LLVM IR is an abstract representation of an instruction set (an unlimited register machine in SSA form) and there are three common serialisations of it:
An in-memory format expressed as C++ classes.
A binary serialisation (bitcode) used to communicate between different components of the toolchain.
A text serialisation, primarily used for debugging.
The key thing is that these are all concrete representations of the same abstract IR and you can losslessly convert between any of them. Only the bitcode format is backwards compatible.
LLVM IR is strongly typed
Kind of. LLVM IR instructions are strongly typed, LLVM memory is untyped.
This is an important distinction between LLVM IR and an assembly language: some operations are explicitly left undefined to leave room for potential optimizations.
In-bounds GetElementPointer (GEP) instructions are possibly a better example here. The assert that the result of pointer arithmetic points to the same object as the original pointer and so the compiler is free to assume that stores of the result of two GEPs on different objects don’t interfere. There may not be enough information in the IR to infer that but there is in the source language for a safe language and so source language compiler informs LLVM that it’s safe to assume this. If the source language is something unsafe like C (or the unsafe subset of Rust) then the programmer, in turn, informs the compiler that it’s safe to make this assumption.
LLVM sometimes calls them registers; in a sense, LLVM IR is assembly for an abstract machine with an infinite number of registers
This is really important because it highlights what a compiler really is: a program that converts from one universal model of computation to another. If you did an undergraduate degree in computer science, you probably encountered unlimited register machines in a theory of computation course. All of this theory is actually useful to working on compilers (and I find it immensely frustrating how few courses tie it in with the practical bits)!
These days, SSA is extremely fashionable for optimizing imperative code.
20 years ago SSA was fashionable (Pro64 used it a decade earlier, GCC adopted it around then). Now, it’s just one of the only two standard ways of expressing an IR (CPS being the other and anything in SSA form can be trivially converted to CPS so the choice here depends on the source language).
volatile can be combined with atomic operations (e.g. load atomic volatile), although most languages don’t provide access to these (except older C++ versions).
If I remember correctly, volatile is there to support the Java memory model, which is quite different to the C++11 memory model. The LLVM memory model was designed to be able to support both and to support programs written in a mixture of both. If you don’t know what that means, consider yourself lucky and stick to the C++11 memory model.
I quietly judge LLVM for having instructions named inttoptr when int2ptr just reads so much nicer.
Sure, fi you’re a native English speaker. If you’re a native French speaker you mentally read this as ‘int deux ptr’ and wonder WTF the writer is on about.
This is a great into to LLVM IR.
I would express this slightly differently. LLVM IR is an abstract representation of an instruction set (an unlimited register machine in SSA form) and there are three common serialisations of it:
The key thing is that these are all concrete representations of the same abstract IR and you can losslessly convert between any of them. Only the bitcode format is backwards compatible.
Kind of. LLVM IR instructions are strongly typed, LLVM memory is untyped.
In-bounds GetElementPointer (GEP) instructions are possibly a better example here. The assert that the result of pointer arithmetic points to the same object as the original pointer and so the compiler is free to assume that stores of the result of two GEPs on different objects don’t interfere. There may not be enough information in the IR to infer that but there is in the source language for a safe language and so source language compiler informs LLVM that it’s safe to assume this. If the source language is something unsafe like C (or the unsafe subset of Rust) then the programmer, in turn, informs the compiler that it’s safe to make this assumption.
This is really important because it highlights what a compiler really is: a program that converts from one universal model of computation to another. If you did an undergraduate degree in computer science, you probably encountered unlimited register machines in a theory of computation course. All of this theory is actually useful to working on compilers (and I find it immensely frustrating how few courses tie it in with the practical bits)!
20 years ago SSA was fashionable (Pro64 used it a decade earlier, GCC adopted it around then). Now, it’s just one of the only two standard ways of expressing an IR (CPS being the other and anything in SSA form can be trivially converted to CPS so the choice here depends on the source language).
If I remember correctly,
volatileis there to support the Java memory model, which is quite different to the C++11 memory model. The LLVM memory model was designed to be able to support both and to support programs written in a mixture of both. If you don’t know what that means, consider yourself lucky and stick to the C++11 memory model.Sure, fi you’re a native English speaker. If you’re a native French speaker you mentally read this as ‘int deux ptr’ and wonder WTF the writer is on about.
Picolisp has written on raw LLVM-IR.