1. 21
  1.  

  2. 3

    Nixpkgs / NixOS has run in to this problem as well, with the same strictness of hash checking. We’ve instead developed tools to normalize the results from GitHub, and compare the hash after normalization, not prior. This works for us since the contents of the archive is what we care about, and not the implementation detail of the archiving process.

    1. 2

      Does this mean archives are being extracted before they’re verified?

      1. 2

        Yes

        1. 2

          This opens up an attack vector through your tar & compression implementations - a bug in them could lead to code execution via maliciously crafted archive.

          1. 3

            Indeed. Only in specific cases are the archives extracted prior to verification, and GitHub is one of the few. However: Nix with sandboxing turned on (and everybody should have sandboxing turned on) will extract the contents in a very limited sandbox, with read access to limited paths, and write access to a single directory. The code could execute, but couldn’t do very much to the host itself. Other potential concerns involve access to a limited set of environment variables, and possibly the nscd socket. There is the chance of sandbox escapes and kernel vulnerabilities, yes, but we’ve found this to be an acceptable trade off in the few cases we’ve needed it.

            1. 2

              I applaud sandboxing. However I believe we (as the general community of people packaging software for various OSes) should coordinate on fixing the root issue. Get people to upload release tarballs, stop them from silently moving tags etc. Many people are just not aware, asking them nicely may be enough to solve it - one by one. You won’t get all of them to switch, but many more will if approached - instead of working around the issue.

              1. 3

                I completely agree, and as a community we advocate for efforts in reproducible builds and good packaging practices. I’m sure we’ve asked people to make good releases in the past. I applaud OpenBSD’s efforts to push here as well.

            2. 2

              I mean, gunzip + untar is a lot less complicated than, say, TLS 1.2 + HTTP2 + gzip. Which we pass untrusted data through all the time.

              Perhaps this fear is really an issue with the raw C implementation in gnutar.

              1. 3

                Reducing attack surface anywhere in the chain is valuable.

      2. 1

        TWe’ve got a similar problem at RhodeCode when we generated archives from git/hg repos. At the beginning we’ve got reports about hash changed of generated archives files, but we found that was the archiver added some metadata files with current date inside. Then we fixed that and used similar logic from Mercurial to create archives from all files inside a repository at particular commit. This resulted in stable archives with same hash every time.

        You have to use a consistent pointer via commit to guarantee that.e.g http://server.com/repo_name/archive/a9611b8be70e3dfafdd91d3e64c6636945277636.zip

        Our implementation is slower because it requires creating zip files via python code, and iterating all files in the repo, but combined with archive caches it’s good enough.

        1. 1

          I’m not fully understanding what issue is being described here. Is it that the archive URLs are unreliable, i.e. the “Source code (zip / tar.gz)” URL?

          1. 2

            The hash of the auto-generated tar files is not stable. I assume the compression level changes or the tar implementation to create them.

            1. 1

              And what about the zip files?

              1. 3

                Same problem with zip files.

                The OpenBSD ports tree stores checksums of release artifacts to ensure authenticity of code that is being compiled into packages.

                Github’s source code links create a new artifact on demand (using git-archive, I believe). When they upgrade the tooling which creates these artifacts the output for existing artifacts can change, e.g. because the order of paths inside the tarball or zip changes, or compression level settings have changed, etc.

                Which means that trying to verify the authenticity of a github source link download against a known hash is no better than downloading a tarball and comparing its hash against the hash of another distinct tarball created from the same set of input files. Hashes of two distinct tarballs or zip files are not guaranteed to match even if the set of input files used to create them is the same.

                1. 1

                  Thank you for the detailed response! I understand the issue now.

                  There are likely tradeoffs from GitHub’s perspective on this issue, which is why they create a new artifact on demand. They maintain a massive number of repositories on their website, so they probably can’t just store all those artifacts for long periods of time as one repository could potentially be gigantic. There are a number of other reasons I can think of off the top of my head.

                  Why not have the checksum run against the file contents rather than the tarball or zip?

                  1. 3

                    Why not have the checksum run against the file contents rather than the tarball or zip?

                    One reason is that this approach couldn’t scale. It would be insane to store and check potentially thousands of checksums for large projects.

                    It is also harder to keep secure because an untrusted archive would need to be unpacked before verification, see https://lobste.rs/s/jdm7vy/github_auto_generated_tarballs_vs#c_4px8id

                    I’d rather turn your argument around and ask why software projects hosted on github have stopped doing releases properly. The answer seems to be that github features a button on the web site and these projects have misunderstood the purpose of this button. While some other projects which understand the issue actively try to steer people away from the generated links by creating marker files in large friendly letters: https://github.com/irssi/irssi/releases

                    I’d rather blame the problem on a UI design flaw on github’s part than blaming best practices software integrators in the Linux and BSD ecosystems have followed for ages.

                    1. 2

                      Some more specifics on non-reproducible archives: https://reproducible-builds.org/docs/archives/.

                      Why not have the checksum run against the file contents rather than the tarball or zip?

                      Guix can do something like that. While it’s preferred to make packages out of whatever is considered a “release” by upstream, it is also possible to make a package based directly on source by checking it out of git. Here’s what that looks like.