The programmer is doing many edit/compile/test cycles quickly, in Debug mode, using the native (no LLVM!) x86 machine code backend. This detects which functions have been outdated and directly updates their machine code in the output binary, just as this article suggests doing.
However, the machine code relies on function calls to language runtime functions such as memcpy, or perhaps __divtf3 (128-bit floating point division). For this, zig provides compiler-rt by distributing source code, and compiling it lazily for the selected target and options (perhaps you choose -OReleaseSmall rather than -OReleaseFast!).
For producing a compiler-rt object, we want to use full LLVM optimizations. We also do not expect the developer to be regularly editing the source code of that component. So instead of doing this fancy mechanism to build compiler-rt functions directly into the compiler-rt object file, which reduces the runtime performance of the generated code, there is this other kind of cache mode which is whole rather than incremental.
This way we have the best of both worlds - incremental compilation for the main application, while dependency components that rarely change can be fully optimized with a slow backend such as LLVM, and then cached as a unit.
This is all behind the scenes machinery that automatically does the best thing by default. From a user’s perspective, they just run zig build and think, “wow that was fast!”
You can expect to play with this functionality with the release of Zig 0.10.0 which is scheduled for May 2022.
From the perspective of compilers of the Turbo Pascal era, compilers on Unix looked slow: they would call assemblers and linkers as a separate process whereas Turbo Pascal would create an executable directly. Certainly it is possible to create binaries more directly but other desirable features like FFI might still depend on doing this in steps.
You will always need a linker to call external functions from system libraries, but you could use this approach internally for almost everything, and then run a linker to link in libc.so and friends. I’m.not sure how much link time would be saved, but my intuition is that it would be a lot.
Unless I misunderstood your point (i.e. we do need a dynamic linker at runtime for shared libraries), Go does not use a conventional linker step to link against a symbol in libc or any other shared library (see //go:cgo_import_dynamic directive), and neither the presence of the target library is required (in fact, it is not accessed at all).
Older Windows/Mac IDEs like Turbo, Lightspeed/THINK, Metrowerks used the same compile-then-link architecture, they just had the compiler and linker built into the app instead of being separate binaries. I definitely recall waiting for THINK C++’s link phase to finish.
That was my first reaction, though it’s not quite right. You could create a single program from the compilation unit containing main and emit every .o file as a .so and get something similar, but not quite the same. The main difference is COMDAT handling. In C++, for example, every inline function is emitted into every .o file that references it and the linker then discards duplicates. Old linkers didn’t support this and the original CFront C++ compiler worked differently, reading the linker errors for missing definitions and then regenerating object files to add them as required. With static linking, COMDATs are just discarded (well, all except one copy). With dynamic linking, they are all there and are often turned into copy relocations so that a single canonical copy exists appended to the main program’s globals and everything references this.
The problem with this approach is that you’re not avoiding work, you’re just doing it at program launch time instead of at link time. If you run the program twice, you’re doing more work. The author is proposing something of a hybrid, closer to how incremental linking works (MSVC’s LINK.EXE can do this, for example), where you keep the relocations around even after you’ve processed them, so that if you redefine a function then you can append it to the binary and then reapply all of the relocations that referenced it. This lets you do it a bit more eagerly than the author proposes (on first use, which is how PLT stubs are resolved). It doesn’t always work if you have fixed layout requirements. Incremental linking adds padding to all globals so that they don’t require relinking if they are replaced with larger ones (this happens a lot with C++ where you add a field to a class that has a global or static instance) but that breaks things like the Objective-C metadata sections that rely on instances of a fixed-layout structure being merged into a particular section.
Binary formats typically require individual sections to be contiguous, so you’d most likely need to either preallocate each section as a large size (which is fine on most filesystems, where holes are more or less free, so you can allocate each section on a 4GiB boundary and just leave the unused space unallocated), or define a new format that put the different sections in different files (or allow multiple instances of the same section that the loader will map contiguously).
Some folks at Sony (I think) are working on a thing that performs caching at multiple levels (clang AST, LLVM IR, object code) so that the communication between different parts of the toolchain isn’t just forward data flow. If you require a C++ template definition a second time, their idea is that clang can just pull in the AST if it needs it for anything (generally, it just needs to query if it exists), then the ThinLTO step can pull in the cached IR if it might be useful for inlining, and if not then everything just gets the cached object code for the function. That’s the direction I’d like to see new toolchains pursue further because it should scale up to doing massive cloud builds, with whole-program analysis, without losing incremental build speed.
There’s a lot more subtleties than that - even if you resolve symbols dynamically, you still need to link and that entails questions like paring down the resulting binary, relocation, etc.
when A is used for the first time, write a stub for A to the final binary and call that
I think this could be even simpler. Is there anything that would prevent ELF relocation sections from solving this directly, without stubs?
For the whole approach I think you’d only lose two types of features: ability to easily strip unused functions and other LTO things and the ability to do PGO style rearranging of code for better cache usage.
Directly related to this- I have just about wrapped up a patchset introducing the concept of “CacheMode” into the Zig compiler. In summary, it enables the following use case:
The programmer is doing many edit/compile/test cycles quickly, in Debug mode, using the native (no LLVM!) x86 machine code backend. This detects which functions have been outdated and directly updates their machine code in the output binary, just as this article suggests doing.
However, the machine code relies on function calls to language runtime functions such as
memcpy
, or perhaps__divtf3
(128-bit floating point division). For this, zig provides compiler-rt by distributing source code, and compiling it lazily for the selected target and options (perhaps you choose-OReleaseSmall
rather than-OReleaseFast
!).For producing a compiler-rt object, we want to use full LLVM optimizations. We also do not expect the developer to be regularly editing the source code of that component. So instead of doing this fancy mechanism to build compiler-rt functions directly into the compiler-rt object file, which reduces the runtime performance of the generated code, there is this other kind of cache mode which is
whole
rather thanincremental
.This way we have the best of both worlds - incremental compilation for the main application, while dependency components that rarely change can be fully optimized with a slow backend such as LLVM, and then cached as a unit.
This is all behind the scenes machinery that automatically does the best thing by default. From a user’s perspective, they just run
zig build
and think, “wow that was fast!”You can expect to play with this functionality with the release of Zig 0.10.0 which is scheduled for May 2022.
Good, if brief, summary of the pro’s and cons. Reminds me of how Zig is working on incremental compilation with binary patching: https://kristoff.it/blog/zig-new-relationship-llvm/
From the perspective of compilers of the Turbo Pascal era, compilers on Unix looked slow: they would call assemblers and linkers as a separate process whereas Turbo Pascal would create an executable directly. Certainly it is possible to create binaries more directly but other desirable features like FFI might still depend on doing this in steps.
You will always need a linker to call external functions from system libraries, but you could use this approach internally for almost everything, and then run a linker to link in libc.so and friends. I’m.not sure how much link time would be saved, but my intuition is that it would be a lot.
Unless I misunderstood your point (i.e. we do need a dynamic linker at runtime for shared libraries), Go does not use a conventional linker step to link against a symbol in libc or any other shared library (see
//go:cgo_import_dynamic
directive), and neither the presence of the target library is required (in fact, it is not accessed at all).Older Windows/Mac IDEs like Turbo, Lightspeed/THINK, Metrowerks used the same compile-then-link architecture, they just had the compiler and linker built into the app instead of being separate binaries. I definitely recall waiting for THINK C++’s link phase to finish.
Prefer image-based development with thunks for redefinition and JIT to make up the performance difference.
The author just described dynamic linking, which we already have, and pretty much works as described.
That was my first reaction, though it’s not quite right. You could create a single program from the compilation unit containing
main
and emit every.o
file as a.so
and get something similar, but not quite the same. The main difference is COMDAT handling. In C++, for example, everyinline
function is emitted into every.o
file that references it and the linker then discards duplicates. Old linkers didn’t support this and the original CFront C++ compiler worked differently, reading the linker errors for missing definitions and then regenerating object files to add them as required. With static linking, COMDATs are just discarded (well, all except one copy). With dynamic linking, they are all there and are often turned into copy relocations so that a single canonical copy exists appended to the main program’s globals and everything references this.The problem with this approach is that you’re not avoiding work, you’re just doing it at program launch time instead of at link time. If you run the program twice, you’re doing more work. The author is proposing something of a hybrid, closer to how incremental linking works (MSVC’s
LINK.EXE
can do this, for example), where you keep the relocations around even after you’ve processed them, so that if you redefine a function then you can append it to the binary and then reapply all of the relocations that referenced it. This lets you do it a bit more eagerly than the author proposes (on first use, which is how PLT stubs are resolved). It doesn’t always work if you have fixed layout requirements. Incremental linking adds padding to all globals so that they don’t require relinking if they are replaced with larger ones (this happens a lot with C++ where you add a field to a class that has a global or static instance) but that breaks things like the Objective-C metadata sections that rely on instances of a fixed-layout structure being merged into a particular section.Binary formats typically require individual sections to be contiguous, so you’d most likely need to either preallocate each section as a large size (which is fine on most filesystems, where holes are more or less free, so you can allocate each section on a 4GiB boundary and just leave the unused space unallocated), or define a new format that put the different sections in different files (or allow multiple instances of the same section that the loader will map contiguously).
Some folks at Sony (I think) are working on a thing that performs caching at multiple levels (clang AST, LLVM IR, object code) so that the communication between different parts of the toolchain isn’t just forward data flow. If you require a C++ template definition a second time, their idea is that clang can just pull in the AST if it needs it for anything (generally, it just needs to query if it exists), then the ThinLTO step can pull in the cached IR if it might be useful for inlining, and if not then everything just gets the cached object code for the function. That’s the direction I’d like to see new toolchains pursue further because it should scale up to doing massive cloud builds, with whole-program analysis, without losing incremental build speed.
There’s a lot more subtleties than that - even if you resolve symbols dynamically, you still need to link and that entails questions like paring down the resulting binary, relocation, etc.
I think this could be even simpler. Is there anything that would prevent ELF relocation sections from solving this directly, without stubs?
For the whole approach I think you’d only lose two types of features: ability to easily strip unused functions and other LTO things and the ability to do PGO style rearranging of code for better cache usage.
Some early versions of Xcode had a “zero-link” feature that did basically this.
I don’t remember when it was quietly removed, or why; maybe because it got too expensive to maintain in parallel with the regular linker?