See the longer discussion of Unicode in general in the design notes.
Concretely, we currently use Rust String for all paths and file contents, but internally interpret them as as bytes (not UTF-8) including using unsafe sometimes to convert.
Based on my superficial understanding of how safety relates to UTF-8 in Rust strings, it’s probably harmless given that we never treat strings as Unicode, but it’s also possible some code outside of our control relies on this. But it does mean there’s a bunch of kind of needless unsafes in the code, and some of them are possibly actually doing something bad.
We could fix this by switching to using a bag of bytes type, like https://crates.io/crates/bstr. But it is pretty invasive. We would need to use that not only for paths but also console output, error messages, etc. And it’s not clear (again, see above design notes discussion) that using bags of bytes is the desired end state, so it’s probably not worth doing.
I think this is totally unacceptable. The project should be considered not-production-ready until they switch to bstr or something similar
Some part of Rust community will feel strongly and vocally about misusing unsafe (this is a serious misuse of unsafe according to the current culture).
There’s probably some bugs where invalid input makes n2 crash, rather than emit an error. Eyeballing, I think this truncate could cause a crash? It asserts that the truncation point is the utf-8 char boundary, which is not something n2 enforces.
It is unlikely, but possible that there’s some sort of really bad bug where untrusted input would cause an RCE or something.
But, more generally, it feels like this whole thing is really just fighting the language? There’s a very specific, intentional dichotomy between utf-8 – String, everything else – Vec<u8>. It’s not about “why don’t you do something different”, but rather about “why do you do this”. The default action here would be to use Vec<u8> / [u8] for everything. I don’t think you need bstr. It will make certain things nicer, but a vector of bytes is quite capable and convenient type in its own right.
That is if you want to use the raw-bytes model. There’s an argument that you can’t escape encoding (because, on Windows, the build.ninja will be in utf-8/ascii, and the paths are always UTF-16, so something will need to encode even for ascii case). From this perspective, it makes sense to use String for build.ninja, OsString or Vec<u8> for parts that represent paths/arguments (I thinkOsString is the morally correct choice here, but its internal representation is insufficiently transparent to do things like path simplification easily, so you might want to roll your own on top of Vec<8> here) and Vec<8> for everything that might be encoded (like compiler’s output a-la /showIncludes), with a clean layer that decodes Vec<8> into OsString.
(for completeness, there’s also a school of thought that says that, instead of going out of the way to represent non-utf8 inputs, build tooling should just mandate that everything is utf-8, and error early for everything that’s not, forcing everyone to make their builds non-weird. That is, you’ll still parse WTF-16 you get from msvc, but you error if it can’t be decoded into valid utf-8. https://docs.rs/camino/latest/camino/ is a library for that)
(I know you know this but) when getting into these weeds it’s important to note that UTF-16 doesn’t exist: every system that uses 16-bit code units is actually WTF-16 because they all allow unpaired surrogates.
that there’s some sort of really bad bug where untrusted input would cause an RCE or something
There’s a lot of untrusted input problems where the build file says things like command = rm -rf / and deletes all your files, which is why I was curious about the specific worries in this. All of the responses here have been of the form “it makes people upset” without any of the underlying reasons.
In all this thread feels very silly to me, especially given that it’s mostly tedious to change and doing so would not provide any practical benefit. (Meanwhile had it been written in pretty much other language it could be making the same mistake and nobody would even notice.)
While it is silly in the context of your particular use-case, I think it is a rational reaction (in direction, not necessary in degree) in the context of the broader Rust ecosystem.
Your context is that you don’t care too much about UB: while.truncate and .parse are technically UB the way they are used in n2, and they even can cause a minor bug in practice (a crash rather than clean exit with an error), it’s no big deal, as n2 assumes trusted input.
The broader Rust context is that untrusted input is the norm though. This sort of code in some random HTTP middleware would have been a huge problem. The practical safety of Rust hinges not only on the statically-checkable rules of the safe subset, but also on the community culture of using unsafe correctly. Keeping the language constant, the actual safety of the ecosystem could be much much lower if the equilibrium point is where everyone is using unsafe willy-niliy, without thinking and upholding safety invariants.
As this is a culture question, it is really hard to make one-off exceptions for special cases, and there’s usually push-back both from senior members of the community who try to deliberately steer the culture, as well as from less senior members, who I think mostly fall for us-vs-them thing, with memory safety being a dividing line.
People don’t write code like this. This will scare away potential contributors. If I was looking for Rust job and I knew that in my potential employer codebase &str is not necessary UTF-8, I will probably rejected that job.
Also, you lose benefits of Rust strong type system. Ideally, types should represent sets of allowed values as accurately as possible. If I would write something like n2, I would use different type for any kind of string: one for proper UTF-8, one for arbitrary bytestrings, one for WTF-8, etc, this would help me write correct code. See also https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-validate/ .
(Fun fact: when writing this comment, I thought that putting invalid UTF-8 to str is undefined behavior, and thus compiler is allowed to do anything. I did dig to Rust reference in attempt to find a proof. And… it turned out that this is not undefined behavior. :) So, yes, today I became slightly smarter. :) Thank you. :) But I still think that nobody should store arbitrary bytestrings in str/String. Yes, this is not immediate undefined behavior, but standard library docs still warn ( https://doc.rust-lang.org/std/primitive.str.html#invariant ) that a lot of code both in standard library and outside of it assume that str is valid UTF-8.)
Also, such fundamental tools (sitting very deeply in dependency chain) as n2 should not be written in Rust. I love Rust. Today I write everything in Rust. But if I wrote something like n2, I would choose C. (And if I did choose Rust, then I would explicitly scare away people from using the tool for core Linux packages.)
Reason is simple: Rust’s compiler is big and complex. And it is written in Rust itself. This complicates building Rust programs for unusual architectures, even if Rust compiler is already ported to that architecture.
I was able to successfully set up almost all of my standard workstation loadout on it [RISC-V box], with some notable exceptions. Firefox is the most painful omission — bootstrapping Rust is an utter nightmare and no one has managed to do it for Alpine Linux riscv64 yet (despite many attempts and lots of hours wasted), so anything which depends on it does not work. librsvg is problematic for the same reason; I had to patch a number of things to be configured without it.
Then in footnote:
Incidentally, my new language [Hare] can be fully bootstrapped on this machine in 272 seconds, including building and running the test suite. For comparison, it takes about 10 seconds on my x86_64 workstation. Building LLVM on this machine, let alone Rust, takes upwards of 12+ hours. You can cross-compile it, but this is difficult and it still takes ages, and it’s so complicated and brittle that you’re going to waste a huge amount of time troubleshooting between every attempt
As well as I understand, Rust compiler was already ported to riscv at that point, but still building rustc (and everything dependent on it, such as Firefox) happened to be too difficult for Drew.
If (let’s imagine this) n2 becomes widespread tool and many core projects (sitting deeply in dependency chain) start to use it, then building Alpine with Firefox for riscv, etc, becomes even more difficult.
C is the most portable programming language. Rust actually has a pretty admirable selection of supported targets for a new language (thanks mostly to LLVM), but it pales in comparison to C, which runs on almost everything. A new CPU architecture or operating system can barely be considered to exist until it has a C compiler. And once it does, it unlocks access to a vast repository of software written in C. Many other programming languages, such as Ruby and Python, are implemented in C and you get those for free too.
Please, make no mistake: I love Rust! I write everything in it. And I disagree with many things in this Drew’s article (“Rust is not a good C replacement”). I think that Rust is good C replacement. But I still agree with Drew in one thing: Rust currently is not as portable as C. And writing core tools, such as build systems, in Rust, is unacceptable, because this creates a lot of problems for system creators, for system programmers, etc.
Look at this list: https://bootstrap.debian.net/essential.html . This is list of transitive build dependencies of Debian package “bash” (or “coreutils”, or “libc”, the list will be same no matter where you start). If you port Debian to new architecture, then you should eventually port all packages in the list, otherwise the port cannot be considered “complete”. Every tool in that list transitively depends on every other! This means that author of any tool in that list can easily perform supply chain attack (remember https://en.wikipedia.org/wiki/XZ_Utils_backdoor ) on all Debian packages. I think that this list should be kept as small as possible. To easy porting of Debian to new architectures and to prevent supply chain attacks. Alas, the list grows beyond any control and becomes bigger, not smaller. (And other Linux distros, of course, have similar lists, through they not always actually generate them and put to some web site.) Unfortunately, the list contains rust, too. This is very unfortunate. I strongly believe that rustc should be removed from that list. And yes, this is possible (if someone makes some heroic efforts.) But if your n2 becomes widespread and people start to use it for core Linux/Debian packages, then removing rustc from that list will become harder.
So, again: I think this is totally OK to write n2 in Rust as long as you write warning: “Please, don’t use this build system for core Linux packages sitting deeply in dependency chain”.
(And of course, all these is just advice, of course, I don’t order you to do anything.)
And yes, I think ninja and n2 are both small and nice tools. Thanks for creating them!
I’m not very convinced by this argument. The complicated parts of Rust are all in the frontend and don’t need any changes to port them. If you can port GCC and Clang to your niche architecture you can port Rust too.
The total complexity of the compiler is only important if you’re doing a bootstrap from assembly which is a rare thing to do. Most source distros will start with a binary or cross compiler for C, why not have a Rust one as well? There’s mrustc for the case where you truly want to do a native bootstrap anyway.
I wouldn’t necessarily go to Drew for unbiased takes on Rust.
If the build system uses the compatible subset of n2 and ninja then you can substantially reduce bootstrapping worries by using the C implementation, samurai.
After having used Ninja in my last job, I might adjust the plan in the blog post in following ways:
Use an mtime-like sequence number rather than raw mtimes or content-hashes.
I was surprised at work when we found that (custom) content-hash computation was a non-trivial time expenditure; however, it’s possible that we used “deep” traces where we could have used “shallow” traces and not had as much of a performance impact (for the reason of concurrent hash computation as mentioned in the article; it would also enable early cut-off).
I suspect you can resolve the listed problematic scenarios with a kind of lexicographical ordering involving mixing in a unique monotonically-increasing build sequence number, but I haven’t verified.
Alternatively, there may be some clever data structure for building and querying a partial ordering efficiently that doesn’t involve literal numbers. In some sense, you’re trying to order unrelated events in a “distributed” system (concurrent build tasks), but using a single global clock may have fundamental issues in which it ascribes “happens-before” relationships and inferences in places where no such relationship actually exists.
Lean much more into the article’s “manifests” and symbolic representations of nodes in the build graph, rather than “files” as the inherent type of node.
It’s good for explainability if you store more data, as mentioned in the article, especially if you do early-cutoff.
It seems useful at the fringes of the build graph where you want to represent side effects rather than real files; I recall a lot of weird behavior/bugs with Ninja’s “phony” rules in particular, which can be used to represent those.
I recall having to contort the build graph uncomfortably to make it fit into the file-per-node model, and that just doing I/O on all the files representing build nodes was a non-trivial cost, even when they didn’t have meaningful content.
In principle, a build rule may be able to optimize incremental builds in a domain-specific way if you give it the old and new manifests for comparison, although that’s an involved optimization.
Lean much more into the article’s “manifests” and symbolic representations of nodes in the build graph, rather than “files” as the inherent type of node.
FWIW, in build2 we have added the notion of a target type. So instead of (using make syntax):
libfoo.a: foo.c
We write:
liba{foo}: c{foo}
Besides solving other problems (like varying extensions for same file types across platforms), this allowed us to have non-mtime-based or even non-file-based targets. Specifically, we have alias (which is similar to the phony targets) and group (which represents multiple targets). You can see the beginning of the target type hierarchy in Targets and Target Types.
While we received (and keep receiving) a lot of flack for the “strange” syntax, looking back, it was one of the best design decisions that we’ve made, IMO.
Representing and operating on the structured data for the build graph nodes feels a lot more natural and extensible than operating primarily on files like make/ninja. I missed having access to Bazel/buck2-style “providers” for arbitrary structured metadata, and being able to process the metadata directly in the build language, rather than having to shell out.
[tangential] An interesting difference between build2 targets and Bazel/buck2-style providers seems to be that the former are inheritance-/subtyping-based, while the latter are more interface-/protocol-based.
Of course, one question is: when does Ninja cease to be an “assembler” rather than a “compiler”, as per its design description?
On one hand, the build metadata doesn’t really ‘need’ to live in the Ninja build graph, rather than the generator.
On the other hand, there are meaningful performance concessions when you have to represent every node as a file (and possibly correctness implications too, when you consider mtime issues, filesystem issues. etc.).
I wonder if there’s a minimally-invasive design change that could lift n2 to operate on more abstract nodes?
Of course, it may turn out that existing Ninja files wouldn’t be able to take meaningful advantage of it without a rewrite.
But maybe it would address certain issues around mtimes or phony rules?
Some additional context for those unfamiliar with Ninja:
Ninja has rules which can be parametrized, but perhaps the main difference is that the data flows in only one direction?
The instantiation of a rule can pass in information (limited to string key-value pairs), but a rule can’t easily ask for information about its inputs.
Since Ninja is intended to be generated, it doesn’t matter from an expressiveness standpoint, because you can arrange for rules to be instantiated with all the information they would need.
IIRC I still found the lack of structure inconvenient when writing/maintaining the generator program.
There was a lot of dumping and parsing important build metadata into files as inputs to build rules, and then the rule implementation would have to call back into some script at runtime to interpret the input blob.
You might need to forward metadata across many different kinds of rules. I think this means that all rules potentially need to know about all other rules, or you somehow otherwise need to account for this in your generator.
It was somewhat painful to implement and reason about the information flow at different evaluation times, although this is just a general compiler problem.
An interesting difference between build2 targets and Bazel/buck2-style providers seems to be that the former are inheritance-/subtyping-based, while the latter are more interface-/protocol-based.
We actually have both: target type inheritance (which is handy for representing fundamental “types” of targets) and less rigid metadata that can be associated with targets using target-specific variables. We use the latter all over the place: rules can use it (e.g., the compile and link rules for C/C++ communicate some information like this), users can use it (for example, to communicate information about threads in CHERIoT builds: declaration and use), etc.
Of course, one question is: when does Ninja cease to be an “assembler” rather than a “compiler”, as per its design description?
Perhaps a more interesting question is whether the “assembler” approach, and the meta-build system/project generator that it implies, is still the right design. Meta-build systems have a fundamental flaw (in short, they partition the build graph into two disconnected sub-graphs which leads to the same problems as “why recursive make is considered harmful”).
Can you say more about how you used ninja with content hashes? I have been thinking about adding them to n2 and I think I have a plan where they will be relatively cheap, but I am curious to read real world experience about it.
Checksumming every input file before building is very slow. If you’re considering whether to rebuild foo.a, and foo.a depends on *.o, and each.o depends on each.c, then you have to checksum the full content of every .c file every time you consider making an incremental build of foo. In large projects, this could be thousands, or tens of thousands of files, each of which we have to open(), read(), checksum, and close(), possibly over a network filesystem. For small projects this is fine, but for large projects, this sucks a lot.
blaze/bazel come from a world where source files are stored in a virtual filesystem, which happens to have the ability to tell you a precalculated checksum for every source file (except the ones you’ve changed locally). If you only have to checksum your locally-changed files on each build, that’ll be very fast. But you need filesystem support to make this possible, and we can’t assume that everywhere.
You would expect hash computation/comparison to be strictly slower than mtime computation/comparison even in the best case:
In the case of multi-GB artifacts, hash computation could be noticeably slow even on an individual basis.
On some ontological basis, you could argue that big artifacts are most likely to be on a critical path, under the hypothesis that smaller artifacts earlier in the build generally combine to become bigger artifacts later in the build.
If true, doing hash computation feels painful under the above circumstances because 1) it’s on the critical path and 2) it’s for a bigger artifact and linear in said artifact’s size.
It’s extra expensive if you have to use a networked filesystem to read the file contents.
For a large enough project, you might even consider implementing a cache from file metadata to content hash, like mentioned in the “redo: mtime dependencies done right” section of the linked article.
It probably gets messy if you want to cache networked file hashes in the same way. I don’t know what kind of metadata is reliable in networked contexts.
Also, I see now that they mention a “sequence number” for targets, which is perhaps similar to my original comment of adding a sequence number instead of using mtimes directly.
I suspect that, due to the Ninja architecture, the build and hash computation happened via deep constructive traces, rather than shallow ones:
Therefore, you couldn’t start the build before computing all the hashes. This was already discussed under “Single pass” in the original article.
But, even if you had shallow traces, you would still have to hash large artifacts and many artifacts as per above.
Hash comparison is also fundamentally more expensive than numeric mtime comparison, which eventually starts to add up.
I expect this is more of a data locality issue than a computation issue. Do you store the hashes directly in the core “node” data structure, or somewhere else? If you store them directly, then you can have fewer nodes in the cache; if you store them indirectly, then you have to traverse the indirection when invalidating the graph.
It’s also just higher peak memory usage, which may or may not be substantial in practice.
You can view “hash of file contents” as a kind of output that depends on file mtime/inode/size/etc. and then treat its computation much like other build steps, where it is computed incrementally, parallelized, and cached.
For large files, you could even imagine skipping the actual hashing and falling back on just assuming they never return to a previous state (which only causes overbuilding, not incorrectness). This is why in the n2 post I was thinking about how the hashing approach used there can be pluggable per target.
One last trick is you could borrow the content hashes from git’s index, which maintains its own cache of content hashes and platform-specific tricks to keep that up to date. That only covers files Git tracks though.
Re hash comparison cost itself, note that all you really care about is minimizing the chance that hash(t1)==hash(t2) for each individual file – you’re not comparing hashes of unrelated files – which means the birthday paradox problem you normally worry about with content hashing is different. I think this may mean you can get away with using some subset of a standard hash. A full sha1 is 20 bytes, while the int64 timestamps we use are 8 bytes already, so the data size might be comparable. (Now that I’m typing this, I realize: in particular note that if you’re not doing a full content-addressed storage of file content, you typically only have the last build hanging around, so you only are drawing exactly two entries for the set of your possible birthday paradox.)
I don’t know if this thread will auto-close eventually; I’m happy to talk more by email (me@waleedkhan.name) or GitHub (@arxanas).
I think conceptualizing the metadata -> hash mapping as an incremental build task is the right way of thinking about it. But there end up being complications like bloated dependency graphs, dynamic dependencies, remote execution, etc. that can justify specializing it, rather than literally using the build system itself.
the hashing approach used there can be pluggable per target.
This sounds technically feasible, but strikes me as wrong for a reason I’m having difficulty articulating. It’s something like: if you trust the metadata as an input to produce the content hash, then you should be able to use the metadata directly in all cases, so why should you ever hash the file contents? But I haven’t worked through it. My reasoning might be predicated on an invalid assumption anyways, like that all nodes should be cacheable in an ideal world.
you could borrow the content hashes from git’s index
I specifically looked at this in the context of incrementalizing certain builds. It didn’t seem too useful in practice:
You’re now serialized on recomputing git status, which can take hundreds of milliseconds in large repos.
The computed hash isn’t raw SHA1, but rather prefixed with the blob header, so you can’t directly interoperate with systems that might use SHA1 as the content hash directly.
Potential race conditions. Modified files are not usually stored in the object database, so it’s possible to read an inconsistent file from disk, and you have to prepare for this eventuality.
Like you mentioned, you can’t hash ignored and/or untracked files in this way.
However, deeper integration with source control to enable builds seems useful overall. You might be interested in my talk, which involves incrementalizing builds without daemonizing the build itself, similarly to Ninja. There’s also discussion of keying certain build artifacts from source control hash near the end (“Distributing persistent state”), which I did while at Meta.
Also, I see now that they mention a “sequence number” for targets, which is perhaps similar to my original comment of adding a sequence number instead of using mtimes directly.
A bit late in the discussion, but do you think you could elaborate on the sequence number idea? Specifically, which problems with mtime you think adding it will help address?
To me it seems a persistent sequence number would only be useful if the build system cannot otherwise remember which targets it has updated during any given run (which would be helpful to know if the resulting mtime hasn’t changed due to, for example, low mtime resolution).
Specifically I was referring the four problematic example scenarios listed in the original article.
Indeed, they’re mostly situations where the build system doesn’t “know about” or “remember” a certain rebuild.
In the case of non-file artifacts or side-effects, a literal mtime isn’t available at all.
In the case of early-exit, you also need to support kind of the opposite of the typical scenario, in which you need to mark a node as logically modified even when the underlying file physically hasn’t been touched.
The point of keeping a weird sequence number at all is to enable a very efficient check for whether a node is “up to date”:
That is, even if that information could be determined by processing a historical build log, querying a database, etc.
Specifically, I was remarking that content-hashing is reliable, but not free in comparison to mtime (integer) comparison.
You could consider it a hypothetical optimization for efficiently querying and updating the underlying partial order structure.
I just found this madness in https://github.com/evmar/n2/blob/main/doc/development.md :
I think this is totally unacceptable. The project should be considered not-production-ready until they switch to bstr or something similar
What is the actual problem you’re concerned about?
The immediate effects are:
n2crash, rather than emit an error. Eyeballing, I think this truncate could cause a crash? It asserts that the truncation point is the utf-8 char boundary, which is not something n2 enforces.But, more generally, it feels like this whole thing is really just fighting the language? There’s a very specific, intentional dichotomy between
utf-8– String, everything else –Vec<u8>. It’s not about “why don’t you do something different”, but rather about “why do you do this”. The default action here would be to useVec<u8>/[u8]for everything. I don’t think you need bstr. It will make certain things nicer, but a vector of bytes is quite capable and convenient type in its own right.That is if you want to use the raw-bytes model. There’s an argument that you can’t escape encoding (because, on Windows, the build.ninja will be in utf-8/ascii, and the paths are always UTF-16, so something will need to encode even for ascii case). From this perspective, it makes sense to use String for
build.ninja,OsStringorVec<u8>for parts that represent paths/arguments (I thinkOsStringis the morally correct choice here, but its internal representation is insufficiently transparent to do things like path simplification easily, so you might want to roll your own on top ofVec<8>here) andVec<8>for everything that might be encoded (like compiler’s output a-la/showIncludes), with a clean layer that decodesVec<8>intoOsString.(for completeness, there’s also a school of thought that says that, instead of going out of the way to represent non-utf8 inputs, build tooling should just mandate that everything is utf-8, and error early for everything that’s not, forcing everyone to make their builds non-weird. That is, you’ll still parse WTF-16 you get from msvc, but you error if it can’t be decoded into valid utf-8. https://docs.rs/camino/latest/camino/ is a library for that)
(I know you know this but) when getting into these weeds it’s important to note that UTF-16 doesn’t exist: every system that uses 16-bit code units is actually WTF-16 because they all allow unpaired surrogates.
There’s a lot of untrusted input problems where the build file says things like
command = rm -rf /and deletes all your files, which is why I was curious about the specific worries in this. All of the responses here have been of the form “it makes people upset” without any of the underlying reasons.In all this thread feels very silly to me, especially given that it’s mostly tedious to change and doing so would not provide any practical benefit. (Meanwhile had it been written in pretty much other language it could be making the same mistake and nobody would even notice.)
While it is silly in the context of your particular use-case, I think it is a rational reaction (in direction, not necessary in degree) in the context of the broader Rust ecosystem.
Your context is that you don’t care too much about UB: while
.truncateand.parseare technically UB the way they are used inn2, and they even can cause a minor bug in practice (a crash rather than clean exit with an error), it’s no big deal, asn2assumes trusted input.The broader Rust context is that untrusted input is the norm though. This sort of code in some random HTTP middleware would have been a huge problem. The practical safety of Rust hinges not only on the statically-checkable rules of the safe subset, but also on the community culture of using unsafe correctly. Keeping the language constant, the actual safety of the ecosystem could be much much lower if the equilibrium point is where everyone is using unsafe willy-niliy, without thinking and upholding safety invariants.
As this is a culture question, it is really hard to make one-off exceptions for special cases, and there’s usually push-back both from senior members of the community who try to deliberately steer the culture, as well as from less senior members, who I think mostly fall for us-vs-them thing, with memory safety being a dividing line.
There are
PathandPathBuftypes in the standard library for this and it’s not clear why they’ve chosen something more complicated and error prone.People don’t write code like this. This will scare away potential contributors. If I was looking for Rust job and I knew that in my potential employer codebase
&stris not necessary UTF-8, I will probably rejected that job.Also, you lose benefits of Rust strong type system. Ideally, types should represent sets of allowed values as accurately as possible. If I would write something like n2, I would use different type for any kind of string: one for proper UTF-8, one for arbitrary bytestrings, one for WTF-8, etc, this would help me write correct code. See also https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-validate/ .
(Fun fact: when writing this comment, I thought that putting invalid UTF-8 to
stris undefined behavior, and thus compiler is allowed to do anything. I did dig to Rust reference in attempt to find a proof. And… it turned out that this is not undefined behavior. :) So, yes, today I became slightly smarter. :) Thank you. :) But I still think that nobody should store arbitrary bytestrings instr/String. Yes, this is not immediate undefined behavior, but standard library docs still warn ( https://doc.rust-lang.org/std/primitive.str.html#invariant ) that a lot of code both in standard library and outside of it assume thatstris valid UTF-8.)Also, such fundamental tools (sitting very deeply in dependency chain) as n2 should not be written in Rust. I love Rust. Today I write everything in Rust. But if I wrote something like n2, I would choose C. (And if I did choose Rust, then I would explicitly scare away people from using the tool for core Linux packages.)
Reason is simple: Rust’s compiler is big and complex. And it is written in Rust itself. This complicates building Rust programs for unusual architectures, even if Rust compiler is already ported to that architecture.
Here is post, which proves my point: https://drewdevault.com/2022/01/15/2022-01-15-The-RISC-V-experience.html . The author built Alpine Linux for his personal riscv computer. And he had to exclude Firefox, because parts of it are written in Rust:
Then in footnote:
As well as I understand, Rust compiler was already ported to riscv at that point, but still building rustc (and everything dependent on it, such as Firefox) happened to be too difficult for Drew.
If (let’s imagine this) n2 becomes widespread tool and many core projects (sitting deeply in dependency chain) start to use it, then building Alpine with Firefox for riscv, etc, becomes even more difficult.
Here is another writing by Drew: https://drewdevault.com/2019/03/25/Rust-is-not-a-good-C-replacement.html . One notable fragment:
Please, make no mistake: I love Rust! I write everything in it. And I disagree with many things in this Drew’s article (“Rust is not a good C replacement”). I think that Rust is good C replacement. But I still agree with Drew in one thing: Rust currently is not as portable as C. And writing core tools, such as build systems, in Rust, is unacceptable, because this creates a lot of problems for system creators, for system programmers, etc.
Look at this list: https://bootstrap.debian.net/essential.html . This is list of transitive build dependencies of Debian package “bash” (or “coreutils”, or “libc”, the list will be same no matter where you start). If you port Debian to new architecture, then you should eventually port all packages in the list, otherwise the port cannot be considered “complete”. Every tool in that list transitively depends on every other! This means that author of any tool in that list can easily perform supply chain attack (remember https://en.wikipedia.org/wiki/XZ_Utils_backdoor ) on all Debian packages. I think that this list should be kept as small as possible. To easy porting of Debian to new architectures and to prevent supply chain attacks. Alas, the list grows beyond any control and becomes bigger, not smaller. (And other Linux distros, of course, have similar lists, through they not always actually generate them and put to some web site.) Unfortunately, the list contains rust, too. This is very unfortunate. I strongly believe that rustc should be removed from that list. And yes, this is possible (if someone makes some heroic efforts.) But if your n2 becomes widespread and people start to use it for core Linux/Debian packages, then removing rustc from that list will become harder.
So, again: I think this is totally OK to write n2 in Rust as long as you write warning: “Please, don’t use this build system for core Linux packages sitting deeply in dependency chain”.
(And of course, all these is just advice, of course, I don’t order you to do anything.)
And yes, I think ninja and n2 are both small and nice tools. Thanks for creating them!
I’m not very convinced by this argument. The complicated parts of Rust are all in the frontend and don’t need any changes to port them. If you can port GCC and Clang to your niche architecture you can port Rust too.
The total complexity of the compiler is only important if you’re doing a bootstrap from assembly which is a rare thing to do. Most source distros will start with a binary or cross compiler for C, why not have a Rust one as well? There’s
mrustcfor the case where you truly want to do a native bootstrap anyway.I wouldn’t necessarily go to Drew for unbiased takes on Rust.
If the build system uses the compatible subset of n2 and ninja then you can substantially reduce bootstrapping worries by using the C implementation, samurai.
After having used Ninja in my last job, I might adjust the plan in the blog post in following ways:
FWIW, in
build2we have added the notion of a target type. So instead of (usingmakesyntax):We write:
Besides solving other problems (like varying extensions for same file types across platforms), this allowed us to have non-mtime-based or even non-file-based targets. Specifically, we have alias (which is similar to the phony targets) and group (which represents multiple targets). You can see the beginning of the target type hierarchy in Targets and Target Types.
While we received (and keep receiving) a lot of flack for the “strange” syntax, looking back, it was one of the best design decisions that we’ve made, IMO.
Representing and operating on the structured data for the build graph nodes feels a lot more natural and extensible than operating primarily on files like
make/ninja. I missed having access to Bazel/buck2-style “providers” for arbitrary structured metadata, and being able to process the metadata directly in the build language, rather than having to shell out.Of course, one question is: when does Ninja cease to be an “assembler” rather than a “compiler”, as per its design description?
Some additional context for those unfamiliar with Ninja:
We actually have both: target type inheritance (which is handy for representing fundamental “types” of targets) and less rigid metadata that can be associated with targets using target-specific variables. We use the latter all over the place: rules can use it (e.g., the compile and link rules for C/C++ communicate some information like this), users can use it (for example, to communicate information about threads in CHERIoT builds: declaration and use), etc.
Perhaps a more interesting question is whether the “assembler” approach, and the meta-build system/project generator that it implies, is still the right design. Meta-build systems have a fundamental flaw (in short, they partition the build graph into two disconnected sub-graphs which leads to the same problems as “why recursive make is considered harmful”).
Can you say more about how you used ninja with content hashes? I have been thinking about adding them to n2 and I think I have a plan where they will be relatively cheap, but I am curious to read real world experience about it.
I think this portion in mtime comparison considered harmful (linked from the original article) held true in practice:
You would expect hash computation/comparison to be strictly slower than mtime computation/comparison even in the best case:
I suspect that, due to the Ninja architecture, the build and hash computation happened via deep constructive traces, rather than shallow ones:
I missed this response, sorry for the late reply!
My small ideas in this space are:
Re hash comparison cost itself, note that all you really care about is minimizing the chance that hash(t1)==hash(t2) for each individual file – you’re not comparing hashes of unrelated files – which means the birthday paradox problem you normally worry about with content hashing is different. I think this may mean you can get away with using some subset of a standard hash. A full sha1 is 20 bytes, while the int64 timestamps we use are 8 bytes already, so the data size might be comparable. (Now that I’m typing this, I realize: in particular note that if you’re not doing a full content-addressed storage of file content, you typically only have the last build hanging around, so you only are drawing exactly two entries for the set of your possible birthday paradox.)
I don’t know if this thread will auto-close eventually; I’m happy to talk more by email (me@waleedkhan.name) or GitHub (
@arxanas).I think conceptualizing the metadata -> hash mapping as an incremental build task is the right way of thinking about it. But there end up being complications like bloated dependency graphs, dynamic dependencies, remote execution, etc. that can justify specializing it, rather than literally using the build system itself.
This sounds technically feasible, but strikes me as wrong for a reason I’m having difficulty articulating. It’s something like: if you trust the metadata as an input to produce the content hash, then you should be able to use the metadata directly in all cases, so why should you ever hash the file contents? But I haven’t worked through it. My reasoning might be predicated on an invalid assumption anyways, like that all nodes should be cacheable in an ideal world.
I specifically looked at this in the context of incrementalizing certain builds. It didn’t seem too useful in practice:
git status, which can take hundreds of milliseconds in large repos.blobheader, so you can’t directly interoperate with systems that might use SHA1 as the content hash directly.However, deeper integration with source control to enable builds seems useful overall. You might be interested in my talk, which involves incrementalizing builds without daemonizing the build itself, similarly to Ninja. There’s also discussion of keying certain build artifacts from source control hash near the end (“Distributing persistent state”), which I did while at Meta.
A bit late in the discussion, but do you think you could elaborate on the sequence number idea? Specifically, which problems with mtime you think adding it will help address?
To me it seems a persistent sequence number would only be useful if the build system cannot otherwise remember which targets it has updated during any given run (which would be helpful to know if the resulting mtime hasn’t changed due to, for example, low mtime resolution).
But maybe I am missing something?
Specifically I was referring the four problematic example scenarios listed in the original article.
The point of keeping a weird sequence number at all is to enable a very efficient check for whether a node is “up to date”:
From 2022, but I don’t think it’s been discussed here before.