I don’t think I understand what this means. They seem to have a path that’s depending on some binary somewhere but then builds a chain of C compilers that builds them a working system? I’m not really sure what this buys them. I can bootstrap a FreeBSD system (kernel, userland, and packages) if I have a moderately recent C++ toolchain that is capable of building a modern Clang (which then builds the rest of the system), bmake, and a couple of other tools. The extra steps in bootstrapping the compiler look like more places for malicious code to hide, but maybe each step is verifiable in some way?
The kernel on the host system that’s doing the builds is malicious and so if you’re worried about not being able to trust the compiler, I hope you got the compiler and kernel from different providers.
That 510byte binary is an operating system kernel written in the MBR of a hard drive/floppy disk and started by the bios hardwired to build a 4KB POSIX kernel on power up.
So there is no kernel to trust. But perhaps you mean about the bios bootstrap trust problem which we have not yet solved.
AIUI what they want is to have the initial binary[0] as simple as possible, ie. disassembly is understandable by someone with passing familiarity with ASM. Rest is supposed to be shell and guile scripts, and starting with those ingredients, tinyCC is built, with path towards full GCC.
It is rather moot if it’s all running on a pre-existing POSIX environment since the kernel could compromise anything. I’m still confused about this project’s aims.
Full bootstrapping needs to begin with a computer with all non-volatile storage devices, including firmware flash chips, fully erased, and everything built from source, including firmware. You would need to begin with manually programming a hex monitor similar to that proposed by this project into something. Probably by building a PCB out of fixed-function logic devices that allows you to manually generate SPI transactions (to program an SPI flash) by flipping a switch on and off to manually input binary. The harder part is probably building Linux without a POSIX environment.
Thanks, that makes sense. I’m not sure what this buys you that simply compiling your bootstrap tools with two different toolchains doesn’t though. For example, I can build the FreeBSD bootstrap tools with Clang or GCC (on a FreeBSD or Linux host) and then compare that the binaries that they build are the same. With something like Guix, I’d expect them to be much more able to lean into a package-transparency model, where many people can do the same bootstrap starting with different host platforms and add the hashes that they get and see if they’re getting a different output.
This Bootstrappable Builds project, together with the Reproducible Builds project, has completed one of the only practical examples of “Diverse Double Complication” as described by David A. Wheeler.
I would expect the, to produce functionally equivalent binaries. This is how GCC’s trusting trust defence works. You first build gcc with the system compiler, then with the newly built gcc, then with that compiler. The second and third binaries are both produced by the same version of GCC, compiled with different compilers, and so should be identical. Clang also tries very hard to produce identical output independent of how the compiler was built (and most of the test suite depends on this property).
All of the tools needed to build FreeBSD are in the source tree and are compiled once with the host compiler during the bootstrap phase. They are then used to build everything that ends up being installed. The result is that the final binary should not depend on the compiler used to build the bootstrap tools.
OK, I think you’re saying compile your desired build compiler with two different compilers and then compare the output of the resultant candidates, presumably against unpredictable input, which should be identical. I didn’t quite get that from the initial comment.
Practically, I can see how this is a useful verification method, even if it doesn’t seem to be completely equivalent.
It depends a bit on the threat model. I assume that the kernel is in scope if you’re worried about supply-chain vulnerabilities because it would be trivial to have a kernel patch that spots something that looks like a compiler, watches for specific patterns in its output, and replaces them with something different. If you are using a *NIX distro as your build environment, precisely the same people have access to introduce trojans into the kernel as do in the compiler, so removing the compiler from the TCB doesn’t buy you much. You can kind-of work around this if you build in an attested environment (i.e. have a valid secure boot chain to a known-good kernel), but that depends on having a trusted environment and if you actually trust the environment then you don’t need any of these mitigations. If you assume that an attacker can compromise your supply chain then diversity is a better defense. If I build the bootstrap tools on FreeBSD with clang and Ubuntu with gcc then it’s very hard for someone to inject a trojan into both. If I then compare the outputs and I get the same thing then I have a lot of confidence that any malware in my final system image was present in my source tree.
Yeah, I think this is pretty confusing unless you’re already very guix-savvy; it claims to be fully bootstrapped from source, but then in the middle of the article it says:
There are still some daunting tasks ahead. For example, what about the Linux kernel?
So what it is that was bootstrapped if it doesn’t include Linux? Is this a feature that only works for like … Hurd users or something?
They bootstrapped the userspace only, and with the caveat that the bootstrap is driven by Guix itself, which requires a Guile binary much larger than the bootstrap seeds, and there are still many escape hatches used for stuff like GHC.
reading the hex0 thing, it looks like this means that if you are on a Linux system, then you could build all of your packages with this bootstrapped thing, and you … basically just need to show up with an assembler for this hex0 file?
One thing about this is that hex0 calls out to a syscall to open() a file. Ultimately in a bootstrappable system you still likely have some sort of spec around file reading/writing that needs to be conformed to, and likely drivers to do it. There’s no magic to cross the gap of system drivers IMO
Hex0 is a language specification (like brainf#ck but more useful)
no, you don’t even need an assembler.
hex0.hex0 is and example of a self-hosting hex0 implementation.
hex0 can be approximated with: sed ‘s/[;#].*$//g’ $input_file | xxd -r -p > $output_file
there are versions written in C, assembly, various shells and as it is only 255bytes it is something that can be hand toggled into memory or created directly in several text editors or even via BootOS.
It exists for POSIX, UEFI, DOS, BIOS and bare metal.
I don’t think I understand what this means. They seem to have a path that’s depending on some binary somewhere but then builds a chain of C compilers that builds them a working system? I’m not really sure what this buys them. I can bootstrap a FreeBSD system (kernel, userland, and packages) if I have a moderately recent C++ toolchain that is capable of building a modern Clang (which then builds the rest of the system), bmake, and a couple of other tools. The extra steps in bootstrapping the compiler look like more places for malicious code to hide, but maybe each step is verifiable in some way?
The kernel on the host system that’s doing the builds is malicious and so if you’re worried about not being able to trust the compiler, I hope you got the compiler and kernel from different providers.
it means, it is possible to have a binary root of trust only 510bytes in size.
No need to trust any operating system kernel or anything else.
It can all be built from source code alone.
We proved this out with live-bootstrap and builder-hex0
How does that 510-byte binary write things to files without trusting the kernel that it’s running on?
That 510byte binary is an operating system kernel written in the MBR of a hard drive/floppy disk and started by the bios hardwired to build a 4KB POSIX kernel on power up.
So there is no kernel to trust. But perhaps you mean about the bios bootstrap trust problem which we have not yet solved.
Aha, that’s the bit of the story I was missing. I thought it was a userspace binary that ran on a host kernel.
AIUI what they want is to have the initial binary[0] as simple as possible, ie. disassembly is understandable by someone with passing familiarity with ASM. Rest is supposed to be shell and guile scripts, and starting with those ingredients, tinyCC is built, with path towards full GCC.
[0] https://github.com/oriansj/bootstrap-seeds/blob/master/POSIX/x86/hex0-seed
It is rather moot if it’s all running on a pre-existing POSIX environment since the kernel could compromise anything. I’m still confused about this project’s aims.
Full bootstrapping needs to begin with a computer with all non-volatile storage devices, including firmware flash chips, fully erased, and everything built from source, including firmware. You would need to begin with manually programming a hex monitor similar to that proposed by this project into something. Probably by building a PCB out of fixed-function logic devices that allows you to manually generate SPI transactions (to program an SPI flash) by flipping a switch on and off to manually input binary. The harder part is probably building Linux without a POSIX environment.
Thanks, that makes sense. I’m not sure what this buys you that simply compiling your bootstrap tools with two different toolchains doesn’t though. For example, I can build the FreeBSD bootstrap tools with Clang or GCC (on a FreeBSD or Linux host) and then compare that the binaries that they build are the same. With something like Guix, I’d expect them to be much more able to lean into a package-transparency model, where many people can do the same bootstrap starting with different host platforms and add the hashes that they get and see if they’re getting a different output.
This is at least partially a response to the trusting trust attack
This Bootstrappable Builds project, together with the Reproducible Builds project, has completed one of the only practical examples of “Diverse Double Complication” as described by David A. Wheeler.
https://reproducible-builds.org/news/2019/12/21/reproducible-bootstrap-of-mes-c-compiler/
It is very much a response to the Trusting Trust attack.
Why would you expect two different compiler pipelines to produce identical binaries, that sounds unlikely to be the case?
I would expect the, to produce functionally equivalent binaries. This is how GCC’s trusting trust defence works. You first build gcc with the system compiler, then with the newly built gcc, then with that compiler. The second and third binaries are both produced by the same version of GCC, compiled with different compilers, and so should be identical. Clang also tries very hard to produce identical output independent of how the compiler was built (and most of the test suite depends on this property).
All of the tools needed to build FreeBSD are in the source tree and are compiled once with the host compiler during the bootstrap phase. They are then used to build everything that ends up being installed. The result is that the final binary should not depend on the compiler used to build the bootstrap tools.
OK, I think you’re saying compile your desired build compiler with two different compilers and then compare the output of the resultant candidates, presumably against unpredictable input, which should be identical. I didn’t quite get that from the initial comment.
Practically, I can see how this is a useful verification method, even if it doesn’t seem to be completely equivalent.
It depends a bit on the threat model. I assume that the kernel is in scope if you’re worried about supply-chain vulnerabilities because it would be trivial to have a kernel patch that spots something that looks like a compiler, watches for specific patterns in its output, and replaces them with something different. If you are using a *NIX distro as your build environment, precisely the same people have access to introduce trojans into the kernel as do in the compiler, so removing the compiler from the TCB doesn’t buy you much. You can kind-of work around this if you build in an attested environment (i.e. have a valid secure boot chain to a known-good kernel), but that depends on having a trusted environment and if you actually trust the environment then you don’t need any of these mitigations. If you assume that an attacker can compromise your supply chain then diversity is a better defense. If I build the bootstrap tools on FreeBSD with clang and Ubuntu with gcc then it’s very hard for someone to inject a trojan into both. If I then compare the outputs and I get the same thing then I have a lot of confidence that any malware in my final system image was present in my source tree.
What is this hex0 program that they are talking about? I don’t understand how that is the starting point, could someone expand?
The program is here: https://github.com/oriansj/bootstrap-seeds/blob/master/POSIX/x86/hex0_x86.hex0
It’s a program that reads ASCII hex bytes from one file and outputs their binary form to the second file.
Yeah, I think this is pretty confusing unless you’re already very guix-savvy; it claims to be fully bootstrapped from source, but then in the middle of the article it says:
So what it is that was bootstrapped if it doesn’t include Linux? Is this a feature that only works for like … Hurd users or something?
They bootstrapped the userspace only, and with the caveat that the bootstrap is driven by Guix itself, which requires a Guile binary much larger than the bootstrap seeds, and there are still many escape hatches used for stuff like GHC.
reading the
hex0
thing, it looks like this means that if you are on a Linux system, then you could build all of your packages with this bootstrapped thing, and you … basically just need to show up with an assembler for thishex0
file?One thing about this is that
hex0
calls out to a syscall toopen()
a file. Ultimately in a bootstrappable system you still likely have some sort of spec around file reading/writing that needs to be conformed to, and likely drivers to do it. There’s no magic to cross the gap of system drivers IMOHex0 is a language specification (like brainf#ck but more useful)
no, you don’t even need an assembler.
hex0.hex0 is and example of a self-hosting hex0 implementation.
hex0 can be approximated with: sed ‘s/[;#].*$//g’ $input_file | xxd -r -p > $output_file
there are versions written in C, assembly, various shells and as it is only 255bytes it is something that can be hand toggled into memory or created directly in several text editors or even via BootOS.
It exists for POSIX, UEFI, DOS, BIOS and bare metal.
I have no existing insight, but it looks like https://bootstrapping.miraheze.org/wiki/Stage0 at least tries to shed some light on this :)