1. 45
  1. 8

    Since the 1.3.0 release, the project started maintaining an alternative build farm at https://bordeaux.guix.gnu.org. It’s independent from the build farm at ci.guix.gnu.org (donated and hosted by the Max Delbrück Center for Molecular Medicine in Berlin, Germany), which has two benefits: it lets us challenge substitutes produced by each system, and it provides redundancy should one of these two build farms go down. Guix is now configured by default to fetch substitutes from any of these two build farms.

    This is really cool. I recently tried building NixOS from bootstrap up without any substituters, and found a few bugs in nixpkgs (mismatched hashes, broken derivations, 404s). Having two independent build farms may have caught some of these, so I think it’s a great step for Guix to take.

    1. 6

      I recently tried building NixOS from bootstrap up without any substituters, and found a few bugs in nixpkgs (mismatched hashes, broken derivations, 404s).

      I have been doing this for years. In my experience, these failures most commonly happen because of the following reasons:

      1. NixOS has a tarball cache, but when you build things yourself, you usually get the tarballs from upstream directly, downloading them from the URL specified in the package derivation. But of course, since it has passed some time between when a package was last updated and when you try to download it, these source tarballs may no longer exist or may have been moved to another URL, hence the failures. Or perhaps you are just unlucky and the web server happens to be experiencing an outage when you are trying to download the tarball.

      2. A similar but different issue as above may also happen, which is when upstream has changed a tarball or a release (as opposed to releasing a new version), so that it has different contents than it originally had. This causes hash mismatches/failures when downloading the tarball / git repository. These failures can be good because they also prevent tampering/hacking/man-in-the-middle attacks, but it requires investigation to find out exactly what happened. That said, I’ve never encountered an actual case of tampering, it has always been either a web server returning unexpected contents (e.g. some error message without the appropriate HTTP error code), which is very uncommon, or upstream modifying releases after they had already been released (on purpose or accidentally), which is more common.

      3. Someone updates a package in nixpkgs but forgets to update the hash [1]. This leads to a mismatch between the hash of the tarball specified in the download URL and the hash specified in the package derivation. The NixOS build farm doesn’t detect this issue because when the hash doesn’t change, it simply reuses the cached tarball that had been previously downloaded (this is done to prevent a continuous mass-redownloading of tarballs).

      4. Test failures. Lots of test failures, in package tests. Almost all of these failures occur in tests that are not deterministic. Especially those that have timeouts as a failure condition.

      Although there have been a very tiny minority of tests that have caught genuine bugs that either I or the upstream authors actually fixed, most of these test failures actually end up being ignored by package authors due to being hard to reproduce. Or, on failures related to timeouts, what happens almost always is that the upstream authors simply suggest to increase the timeout, which is just a workaround and doesn’t actually make the tests reliable. So when this happens (and since it’s so extremely common), I don’t even bother to increase the timeouts anymore, I just disable the tests for those packages in my private copy of nixpkgs.

      1. There’s also a specific class of build failures related to non-deterministic builds, or more specifically, building packages in parallel (e.g. make -j64). This happens when file A depends on file B, but due to missing dependencies in the Makefiles, file A actually gets built before file B which means that the build sometimes fails due to B being missing. This is a race condition in Makefiles (it may also happen with other build tools, but it’s usually Makefiles) which may only cause a build failure in rare circumstances (e.g. when the build machine has a high load). I’ve encountered a bunch of these issues before (even in gcc, just recently), although they are getting more rare due to many-core CPUs being more prevalent. An easy workaround is to change enableParallelBuilding = true to enableParallelBuilding = false in the package derivation.

      In summary, I don’t think having another build farm would necessarily decrease the occurrence of the above issues, because they are not issues that get detected when you have a second build farm building packages at mostly the same time as the first one (i.e. right after packages are updated).

      There would likely be a minor increase in detecting failures in non-deterministic tests and package builds, but on the other hand, this also happens with just one build farm (because packages get rebuilt very often) and most of these sporadic failures are simply ignored.

      Although yes, it’s always good to have redundancy, I wouldn’t argue that point :-)

      [1] A few days ago I discussed this issue in NixOS’s security-related Matrix channels and someone suggested a technical improvement in ofborg to detect when a package is being updated but the hash isn’t, whenever someone submits a PR, but I suspect it hasn’t been implemented yet.

      Edit: added failure class number 5.

      1. 5

        Yeah I got one of your case #3, one of your case #2, and a bunch of failures because font packages used a weird hack of fetchZip that silently broke when fetchZip’s interface changed but nobody re-built the fixed-output derivations.

    2. 1

      They nod at it in the post, but for the newcomers just a reminder that Guix releases are ceremonial only and no one is intended to use the release versions (you run guix pull an always run latest) or read the release documentation.