FWIW, I believe Nix solves this by hashing the contents of the archive, and not that archive itself. That depends on having some format to turn the contents (multiple files and directories) into a single stream you can hash, and for Nix that’s the NAR format, which is simple enough.
It boils down to repacking the archive in a deterministic format (no timestamps, consistent ordering of files, etc.) and hashing that instead. But if the hash is all you care about, you can just stream the NAR into your hasher, and not actually write anything to disk or keep it in memory. That’s important for dealing with large archives, which I assume is also something Bazel cares about.
It makes sense for Nix, but this is a weird problem to have with Git repos specifically, since Git already has deterministic hashes of everything stored in it. Big https://xkcd.com/2021/ energy.
Doing the hashing git’s way (i.e. recursively) is also just nice, because it lets you reuse work.
I don’t understand the lack of discussion, by the original author, or here around the use of bazel’s built in git integration ( https://bazel.build/rules/lib/repo/git ) which provides the ability to use git’s underlying file hashing (e.g. for commits) to verify contents instead. Which I feel is the canonical way to consume git repositories for exactly this reason. We use this and did not experience any issues with builds when this happened.
There are of course trade offs against http archives to using it, which is why I say discussion, but stability was a point in the git integrations favor, and why we chose it.
It’s worth noting that the list of BIG enterprises using Bazel has been growing rapidly in recent years: Apple, Tesla, SpaceX, Twitter, Netflix, Uber, AirBnb, Stripe, etc… So this outage did affect a large chunk of Github’s biggest enterprise accounts.
They are very much incentive to keep things as reliable as possible for their customers, reducing churn. If there is a change, it would need to be slowly rollout with proper communication months prior (not after like what happened here).
Counterpoint: Bazel builds were always broken because they assumed an authority that wasn’t there. Matching checksums is a good practice, but you gotta trust the source!
The solution I’d most like to see to this class of problems is for systems to encourage you to set up your own mirrors. And setting one up and using it needs to be easy, cheap and secure.
Requiring that they themselves have a full tree of all the needed tarballs is one thing that distributions like Debian and FreeBSD Ports do that is extremely right.
The solution I’d most like to see to this class of problems is for systems to encourage you to set up your own mirrors.
I’m not sure I understand what you mean. Who is “you” in this scenario? Individual users? People who write software? You mention Debian and FreeBSD, but who else do you have in mind who should set up (and run?) their own mirrors?
FWIW, I believe Nix solves this by hashing the contents of the archive, and not that archive itself. That depends on having some format to turn the contents (multiple files and directories) into a single stream you can hash, and for Nix that’s the NAR format, which is simple enough.
It boils down to repacking the archive in a deterministic format (no timestamps, consistent ordering of files, etc.) and hashing that instead. But if the hash is all you care about, you can just stream the NAR into your hasher, and not actually write anything to disk or keep it in memory. That’s important for dealing with large archives, which I assume is also something Bazel cares about.
Unzipping it first risks getting bitten if there’s some security bug in the unzipper.
True. It would be best if Nix would store the content size along the hash.
Which would also get broken if the archiver changes.
I mean the unpacked content size. If that changes then you have other problems :)
There are exploits in unarchiving more dangerous than filing all the space
It makes sense for Nix, but this is a weird problem to have with Git repos specifically, since Git already has deterministic hashes of everything stored in it. Big https://xkcd.com/2021/ energy.
Doing the hashing git’s way (i.e. recursively) is also just nice, because it lets you reuse work.
I don’t understand the lack of discussion, by the original author, or here around the use of bazel’s built in git integration ( https://bazel.build/rules/lib/repo/git ) which provides the ability to use git’s underlying file hashing (e.g. for commits) to verify contents instead. Which I feel is the canonical way to consume git repositories for exactly this reason. We use this and did not experience any issues with builds when this happened.
There are of course trade offs against http archives to using it, which is why I say discussion, but stability was a point in the git integrations favor, and why we chose it.
It’s worth noting that the list of BIG enterprises using Bazel has been growing rapidly in recent years: Apple, Tesla, SpaceX, Twitter, Netflix, Uber, AirBnb, Stripe, etc… So this outage did affect a large chunk of Github’s biggest enterprise accounts.
They are very much incentive to keep things as reliable as possible for their customers, reducing churn. If there is a change, it would need to be slowly rollout with proper communication months prior (not after like what happened here).
Counterpoint: Bazel builds were always broken because they assumed an authority that wasn’t there. Matching checksums is a good practice, but you gotta trust the source!
The solution I’d most like to see to this class of problems is for systems to encourage you to set up your own mirrors. And setting one up and using it needs to be easy, cheap and secure.
Requiring that they themselves have a full tree of all the needed tarballs is one thing that distributions like Debian and FreeBSD Ports do that is extremely right.
I’m not sure I understand what you mean. Who is “you” in this scenario? Individual users? People who write software? You mention Debian and FreeBSD, but who else do you have in mind who should set up (and run?) their own mirrors?
Software vendors.