The OCaml analogy is nice! It has a nice bytecode VM and native compiler, which both link against the same runtime.
FWIW I was also confused by this at first. Here is how I explained it to myself …
Rust and GCC both use previous versions of the compiler to compile itself. But it’s “your job” to find that binary; it’s not in the repo. This makes bootstrapping inherently a bit flaky.
Zig wants to have the old version of the compiler checked in the repo. On top of that, they don’t want to force devs to work on x86 or ARM, or commit binaries for all arches. It should be architecture-agnostic.
The simplest possible solution would be to say check in an x86-64 binary and provide a script to run QEMU. You can literally use x86-64 as a VM.
But that causes some problems around I/O – how does the compiler read its input and write its output? Do you attach a disk in QEMU, or use SSH?
WASI solves that problem. You just write some little stubs to connect the pure WASM VM to the host. So that’s the 4K lines of hand-written C that they wrote.
That explains WASM, but doesn’t quite explain the translation to C part, which is done for performance … Does it really have to be fast? Why not keep it slow and simple, if it’s only for bootstrapping? But I guess my curiosity is satisfied for now
It does seem a little elaborate, but viewed this way, I see the benefit. The 4K lines of C and all the scripting is a cost to pay, but probably more reliable than QEMU.
(Although back in the day I think it was very simple to bring up a (slow) QEMU on any machine – it was extremely portable C. A funny thing is that the defunct Aboriginal Linux project used QEMU to emulate other arches instead of bothering with cross-compiling GCC, which I think is “only for experts”. )
Does it really have to be fast? Why not keep it slow and simple, if it’s only for bootstrapping?
Contributors will have to go through the wasm bootstrap procedure whenever a new language feature is being developed and introduced into the compiler codebase, so it’s a matter of developer experience.
Additionally, the CI also starts by bootstrapping the compiler to then run the tests. Faster bootstrap means faster CI runst and/or fewer machines we have to pay for.
Yeah that makes sense, I remember QEMU is modular and can emulate the CPU without the kernel. But yeah it has to know about sys calls and then is less portable.
This is an insanely cool approach to bootstrapping and a great use of WebAssembly. Compiling the WASM binary to C was a smart idea. They also have regression tests for commits breaking the update process for the WASM binary. Dropping the peak RSS from 10gb to 3gb is impressive, and makes me want to try contributing again.
Because we have a “C backend” which produces C code instead of machine code and hooks in the compiler pipeline at a point where all comptime evaluation has already happened. It would require a lot of work (if it’s even possible at all) to change this and make it able to produce C code with macros that mirror comptime code. Targeting WASM lets us generate de-facto a platform-independent executable by leveraging existing components (LLVM produces wasm, and we support WASI in the stdlib).
That sounds more complicated than simply targeting wasm, which llvm already supports. We only care about this problem from the perspective of bootstrapping.
This is about code reuse. I agree it is kind of insane way to build the whole system, but Zig is emphatically not building the whole system and in fact avoiding building the whole system is the entire point. This way Zig reuses LLVM’s target-independent optimizer and LLVM’s WebAssembly target and WASI support in the standard library (which they want anyway independent of bootstrapping). Different choices would have been made if there was no code to reuse.
If the post doesn’t click and you still wonder what’s the point of doing this, I made a video more focused on the resulting experience from the perspective of the people who contribute to Zig:
I find that a lot of so called novel use cases of WASM tend to be places where a simpler solution would have sufficed. In this case it would have been simpler to use a C version of the “Zig -> C” compiler, instead of a WASM version. They actually compile their WASM version to a C version during the bootstrap but I don’t understand why they don’t just do this ahead of time, removing the need for WASM during the bootstrap process.
They claim it’s because the C version is ~80MB but I suspect that is the size of the C version that is generated directly by the Zig compiler, not the Zig->Wasm->C version which is first compiled from Zig to WASM by LLVM, then from WASM to C by their custom wasm2c tool. That version should be smaller due to LLVM optimizations on the WASM binary first.
Even in that case, why use WASM at all? Why not create a custom LLVM bitcode -> C tool instead of their custom wasm2c tool? WASM remains an unnecessary translation step.
I find that a lot of so called novel use cases of WASM tend to be places where a simpler solution would have sufficed. In this case it would have been simpler to use a C version of the “Zig -> C” compiler, instead of a WASM version.
The Zig compiler has a C backend, but it cannot produce platform-independent C code from platform-independent Zig code, since it takes in data that results from semantic analysis (ie after comptime is done running). This means that we would have to produce and distribute a ARCH x OS number of versions. Targeting a VM solves this problem since the Zig code needs to only be compiled once, and then the platform-specific glue is provided by the VM itself.
They claim it’s because the C version is ~80MB but I suspect that is the size of the C version that is generated directly by the Zig compiler, not the Zig->Wasm->C version which is first compiled from Zig to WASM by LLVM, then from WASM to C by their custom wasm2c tool. That version should be smaller due to LLVM optimizations on the WASM binary first.
The C file generated by wasm2c is 181 mb on my machine. Your assumption is incorrect.
The Zig compiler has a C backend, but it cannot produce platform-independent C code from platform-independent Zig code
Just use a “portable” by convention ARCH x OS for the comptime stuff and then compile that into portable C (assuming the bare minimum C specifications). No need to have any platform specific glue for the IO functions since you can use stdio (fread(), etc.).
The C file generated by wasm2c is 181 mb on my machine. Your assumption is incorrect.
That’s about 2 orders of magnitude larger than the source WASM (2.6MB) and that signals that something is clearly broken with wasm2c. Expecting to compile this large C file on system C compilers during the bootstrap process seems error-prone. Bootstrap processes should be bullet-proof. Ballpark figures minified + zipped should not be more than 10x the source file.
Even if it was just 10x, the difference between 1.8MB and 18MB is pretty significant, as source control goes. I mean, we’re largely talking about a binary blob, that must be updated every time the compiler requires a new feature from the bootstrap compiler.
For example, it doesn’t abstract platform’s data types, so the code has dependence on specific data type sizes and alignment. It also hardcodes specific calling conventions of the target platform, which may not make sense on the another platform.
Remember that all machine instructions can be compiled into each other, but that doesn’t make x86 binaries not “platform-specific” (even though Rosetta runs them on ARM, Wine runs them on another OS).
Pointer size is fixed in LLVM IR, so once you’ve lowered to LLVM you cannot move between systems with different pointer sizes. The size of C types such as long are also fixed in LLVM to known bit widths, so anything that depends on the platform ABI (which is almost always defined in terms of C types) is non-portable. The different back ends have different conventions for lowering calls. For example, on FreeBSD or Darwin on i386 a union of a pointer and an integer that is returned from a function will be lowered to an i32 return, whereas on Linux it will be a void function that takes an i32* as an sret parameter. Anything involving structure layouts will be lowered to a fixed memory layout embedding the target ABI’s rules.
I guess you could target LLVM 15 x86_64-linux-gnu, write llvm15x8664linuxgnu2c (llvm2c for short), and that’s good enough, except LLVM 15 x86_64-linux-gnu bitcode isn’t intended for reimplementation and relatively hostile and you’d need to care about what is changed in LLVM 16 and at that point I don’t see an advantage over WebAssembly.
You could be a bit more generic than that but yes that’s the general idea. For instance, Linux doesn’t have to play a role.
I don’t see an advantage over WebAssembly.
If doing both ahead of time before the bootstrap process, then there is no obvious advantage outside of the difference in impedance mismatches between C and WASM, LLVM BC respectively. I’d expect it to be minor though.
If you don’t see advantages I consider LLVM bitcode’s instability a large disadvantage. You definitely need to care about it. LLVM 15 introduced large changes like opaque pointers.
The LLVM bitcode is platform-specific insofar as it calls platform-specific APIs and has platform-specific behaviors for particular LLVM instructions. WASM+WASI is being used here to abstract away all these differences - there is one system interface that the compiler needs to use. This also lets Zig not be locked into LLVM which is useful for platforms LLVM doesn’t support.
Let me try to explain. I apologize in advance for any mistakes.
The way I think about it is they put an old version of the compiler in their repo (as a wasm blob) and use it to run a new version of the compiler, stored as Zig source code. So now you can checkout any commit of the repo and run the compiler, even if you don’t have any version of Zig installed. This is what they are after.
All the complexity is for size and speed optimization.
Instead of interpreting wasm code they turn it into C code that’s fed to the system C compiler (for speed).
The old wasm compiler doesn’t have all the different backends built-in, just C output (for size). So you need the system C compiler again to turn your new Zig compiler to a binary you can actually run (also helps speed). Finally, they compile the new compiler Zig code again, but with their own code generation (for speed, I suppose).
I hope this helped. Also, apparently filenames such as “zig-wasm2.c” should be read “zig wasm to C” out loud. Perhaps a scheme too cute for its own good.
Thanks for your explanation. So basically this is to avoid requiring a Zig compiler to be already installed to compile Zig? Which is that bootstrap 0 concept from what I understand. But it still requires a C compiler in any case, so not quite just a “magnetized needle and a steady hand”.
I like the idea of using wasm for bootstrapping on new systems but this particular approach seems a little over-elaborate. Couldn’t one have the same compiler with a wasm backend and a C backend and use one to bootstrap the other?
On the other hand, it’s not a terrible idea. Have a small portable “kernel” of some kind that is very easy to build and port, and is updated very seldom. Will be interesting to see how it works out in practice though!
The OCaml analogy is nice! It has a nice bytecode VM and native compiler, which both link against the same runtime.
FWIW I was also confused by this at first. Here is how I explained it to myself …
That explains WASM, but doesn’t quite explain the translation to C part, which is done for performance … Does it really have to be fast? Why not keep it slow and simple, if it’s only for bootstrapping? But I guess my curiosity is satisfied for now
It does seem a little elaborate, but viewed this way, I see the benefit. The 4K lines of C and all the scripting is a cost to pay, but probably more reliable than QEMU.
(Although back in the day I think it was very simple to bring up a (slow) QEMU on any machine – it was extremely portable C. A funny thing is that the defunct Aboriginal Linux project used QEMU to emulate other arches instead of bothering with cross-compiling GCC, which I think is “only for experts”. )
Contributors will have to go through the wasm bootstrap procedure whenever a new language feature is being developed and introduced into the compiler codebase, so it’s a matter of developer experience.
Additionally, the CI also starts by bootstrapping the compiler to then run the tests. Faster bootstrap means faster CI runst and/or fewer machines we have to pay for.
qemu-user should make this transparent. But qemu-user is platform sensitive and a lot more fiddly than a smaller WebAssembly VM.
Yeah that makes sense, I remember QEMU is modular and can emulate the CPU without the kernel. But yeah it has to know about sys calls and then is less portable.
WASM does seem to make sense for this reason
This is an insanely cool approach to bootstrapping and a great use of WebAssembly. Compiling the WASM binary to C was a smart idea. They also have regression tests for commits breaking the update process for the WASM binary. Dropping the peak RSS from 10gb to 3gb is impressive, and makes me want to try contributing again.
They seem still to translate it to C, so I asked myself why the intermediate WASM is needed at all.
Because we have a “C backend” which produces C code instead of machine code and hooks in the compiler pipeline at a point where all comptime evaluation has already happened. It would require a lot of work (if it’s even possible at all) to change this and make it able to produce C code with macros that mirror comptime code. Targeting WASM lets us generate de-facto a platform-independent executable by leveraging existing components (LLVM produces wasm, and we support WASI in the stdlib).
Why not simply integrate an existing lean VM like Lua or LuaJIT to handle comptime calculations?
That sounds more complicated than simply targeting wasm, which llvm already supports. We only care about this problem from the perspective of bootstrapping.
The six step bootstrap process itself looks pretty complicated, but maybe I have just a completely different conception of a bootstrap process.
This is about code reuse. I agree it is kind of insane way to build the whole system, but Zig is emphatically not building the whole system and in fact avoiding building the whole system is the entire point. This way Zig reuses LLVM’s target-independent optimizer and LLVM’s WebAssembly target and WASI support in the standard library (which they want anyway independent of bootstrapping). Different choices would have been made if there was no code to reuse.
If the post doesn’t click and you still wonder what’s the point of doing this, I made a video more focused on the resulting experience from the perspective of the people who contribute to Zig:
https://youtu.be/MCfD7aIl-_E
My friend said WASM should just be called Universal ASM at this point.
Outstandingly Reusable Go anywhere ASseMbly.
Hmmm… 😂
This is such an amazing use of WASM / WASI, and I can’t stop smiling for some reason.
I find that a lot of so called novel use cases of WASM tend to be places where a simpler solution would have sufficed. In this case it would have been simpler to use a C version of the “Zig -> C” compiler, instead of a WASM version. They actually compile their WASM version to a C version during the bootstrap but I don’t understand why they don’t just do this ahead of time, removing the need for WASM during the bootstrap process.
They claim it’s because the C version is ~80MB but I suspect that is the size of the C version that is generated directly by the Zig compiler, not the Zig->Wasm->C version which is first compiled from Zig to WASM by LLVM, then from WASM to C by their custom wasm2c tool. That version should be smaller due to LLVM optimizations on the WASM binary first.
Even in that case, why use WASM at all? Why not create a custom LLVM bitcode -> C tool instead of their custom wasm2c tool? WASM remains an unnecessary translation step.
The Zig compiler has a C backend, but it cannot produce platform-independent C code from platform-independent Zig code, since it takes in data that results from semantic analysis (ie after comptime is done running). This means that we would have to produce and distribute a ARCH x OS number of versions. Targeting a VM solves this problem since the Zig code needs to only be compiled once, and then the platform-specific glue is provided by the VM itself.
The C file generated by wasm2c is 181 mb on my machine. Your assumption is incorrect.
Just use a “portable” by convention ARCH x OS for the comptime stuff and then compile that into portable C (assuming the bare minimum C specifications). No need to have any platform specific glue for the IO functions since you can use stdio (fread(), etc.).
That’s about 2 orders of magnitude larger than the source WASM (2.6MB) and that signals that something is clearly broken with wasm2c. Expecting to compile this large C file on system C compilers during the bootstrap process seems error-prone. Bootstrap processes should be bullet-proof. Ballpark figures minified + zipped should not be more than 10x the source file.
Even if it was just 10x, the difference between 1.8MB and 18MB is pretty significant, as source control goes. I mean, we’re largely talking about a binary blob, that must be updated every time the compiler requires a new feature from the bootstrap compiler.
LLVM bitcode is platform-specific. WebAssembly here is basically used as a platform-independent alternative to LLVM bitcode.
In what specific way is LLVM bitcode “platform-specific”? Remember that all Turing complete languages can be compiled into each other.
For example, it doesn’t abstract platform’s data types, so the code has dependence on specific data type sizes and alignment. It also hardcodes specific calling conventions of the target platform, which may not make sense on the another platform.
Remember that all machine instructions can be compiled into each other, but that doesn’t make x86 binaries not “platform-specific” (even though Rosetta runs them on ARM, Wine runs them on another OS).
Pointer size is fixed in LLVM IR, so once you’ve lowered to LLVM you cannot move between systems with different pointer sizes. The size of C types such as long are also fixed in LLVM to known bit widths, so anything that depends on the platform ABI (which is almost always defined in terms of C types) is non-portable. The different back ends have different conventions for lowering calls. For example, on FreeBSD or Darwin on i386 a union of a pointer and an integer that is returned from a function will be lowered to an i32 return, whereas on Linux it will be a void function that takes an i32* as an sret parameter. Anything involving structure layouts will be lowered to a fixed memory layout embedding the target ABI’s rules.
In a way that it won’t be bit-for-bit identical across architectures and operating systems.
Could you elaborate on that? At what stage of the compilation from LLVM bc to portable C does bit-for-bit equality play a key role?
I guess you could target LLVM 15 x86_64-linux-gnu, write llvm15x8664linuxgnu2c (llvm2c for short), and that’s good enough, except LLVM 15 x86_64-linux-gnu bitcode isn’t intended for reimplementation and relatively hostile and you’d need to care about what is changed in LLVM 16 and at that point I don’t see an advantage over WebAssembly.
You could be a bit more generic than that but yes that’s the general idea. For instance, Linux doesn’t have to play a role.
If doing both ahead of time before the bootstrap process, then there is no obvious advantage outside of the difference in impedance mismatches between C and WASM, LLVM BC respectively. I’d expect it to be minor though.
If you don’t see advantages I consider LLVM bitcode’s instability a large disadvantage. You definitely need to care about it. LLVM 15 introduced large changes like opaque pointers.
The LLVM bitcode is platform-specific insofar as it calls platform-specific APIs and has platform-specific behaviors for particular LLVM instructions. WASM+WASI is being used here to abstract away all these differences - there is one system interface that the compiler needs to use. This also lets Zig not be locked into LLVM which is useful for platforms LLVM doesn’t support.
Agree; I also don’t see why they think the generated C cannot be made platform independent and sufficiently small.
LLVM does not have a C backend and Zig’s optimization capabilities are nowhere near LLVM’s capabilities (yet?).
Wasn’t there an LLVM IR to C backend some years ago? EDIT: just found this one: https://github.com/JuliaHubOSS/llvm-cbe
It’s on LLVM 10 when LLVM 15 is the latest.
It’s not a good idea to depend on third party, unmaintained software.
Looks like it still gets regular commits.
It probably can. I think this was just less effort.
Yea, I don’t understand this blog post and the buzz around it. Seems like there’s something simple missing.
Let me try to explain. I apologize in advance for any mistakes.
The way I think about it is they put an old version of the compiler in their repo (as a wasm blob) and use it to run a new version of the compiler, stored as Zig source code. So now you can checkout any commit of the repo and run the compiler, even if you don’t have any version of Zig installed. This is what they are after.
All the complexity is for size and speed optimization.
Instead of interpreting wasm code they turn it into C code that’s fed to the system C compiler (for speed). The old wasm compiler doesn’t have all the different backends built-in, just C output (for size). So you need the system C compiler again to turn your new Zig compiler to a binary you can actually run (also helps speed). Finally, they compile the new compiler Zig code again, but with their own code generation (for speed, I suppose).
I hope this helped. Also, apparently filenames such as “zig-wasm2.c” should be read “zig wasm to C” out loud. Perhaps a scheme too cute for its own good.
Ah I see, thank you that helped.
Thanks for your explanation. So basically this is to avoid requiring a Zig compiler to be already installed to compile Zig? Which is that bootstrap 0 concept from what I understand. But it still requires a C compiler in any case, so not quite just a “magnetized needle and a steady hand”.
Really? Zig still looks like it’s in C++ to me. I did open an issue on this a while back though…
(sorry, couldn’t resist).
for great justice!
I like the idea of using wasm for bootstrapping on new systems but this particular approach seems a little over-elaborate. Couldn’t one have the same compiler with a wasm backend and a C backend and use one to bootstrap the other?
On the other hand, it’s not a terrible idea. Have a small portable “kernel” of some kind that is very easy to build and port, and is updated very seldom. Will be interesting to see how it works out in practice though!