1. 48
  1. 22

    It’s sad that most package tools (not just for javascript) are currently stuck in this artificial design dichotomy of allowing the library maintainer to run arbitrary code at build time with the full privileges of the user, and just not providing build-time programmability. Most tools seem to have chosen the former, which is a security problem. A notable exception is Go, which does not provide any kind of hooks to run code when the library is fetched & built by a dependent. But as a consequence of this, it’s become normal for people to just commit generated code to the source repository, which is pretty gross.

    I say it’s sad because there’s no reason you really need to choose; javascript in particular is already designed to be sandboxed; everyone runs tons of untrusted unaudited dodgy arbitrary javascript on their machines every single day via their browser. Just run the scripts without access to the usual OS API, and instead just give it access to put stuff in the build directory and nothing else (including the network).

    1. 4

      What’s wrong with committing generated code?

      1. 11

        I think it’s generally unfortunate to have a bunch of files in a repository that are effectively the result of a build step. Usually you’re generating code from some input source form that represents the truth; e.g., an OpenAPI document or a schema description or something like that.

        When we compile C code we don’t check the result of the preprocessor phase into the repository. It’s ephemeral and discarded. Generated code is not really any different: it’s a dependent artefact that doesn’t itself represent truth, just additional noise in the repository. Like “vendoring”, it makes it harder to see in a diff what’s actually being changed, and it increases the size of the repository – often by quite a lot.

        Sometimes, especially with pedestrian build tools, it’s the only option you have. But that doesn’t make it a great option.

        1. 10

          Yes, and in addition if those files are not part of a proper build process it’s easy to get into a situation where they’re actually hard for folks to reproduce. Usually when I get stuck committing generated code, I’ll add a thing to CI that deletes & rebuilds it, then does a git diff --exit-code to make sure what’s committed is actually what gets built. But it’s a hazard.

          1. 3

            In addition to that, it’s also going back to downloading executables and running them, without having a chance to prove its generated by the actual source code you thought you are getting.

          2. 1

            Usually you’re generating code from some input source form that represents the truth; e.g., an OpenAPI document or a schema description or something like that.

            right, but if you’re not checking in the generated code, and generating that at checkout/build time instead, you’re back to the original problem that you’ve now escaped the source code of the project itself. This isn’t a problem for generated code that is derived from other non-code artifacts in the source repository, such as checking in a json file and then generating code off of that json file, since that’s still a part of the source code. That seems very OK and non-problematic.

            But once you’re making external i/o in the process of the generation step, haven’t you escaped your build/compile sandbox? Even if you can’t execute arbitrary code, accepting arbitrary data may have the same effect and lead to similar dangers of non-reproducibility.

            That doesn’t really seem like a particularly worthwhile tradeoff to me, especially when the problem at hand is “it’s generally unfortunate to have a bunch of files”. That seems like a purely aesthetic argument, but we view our code through tools; we don’t look at the electrons with our eyeballs, so it’s only a question of how the code is viewed. I don’t think you’re making the argument that the problem is that storing the generated files is prohibitively costly from a resource standpoint, but more that you just don’t like looking at them. That’s not to trivialize that point; I don’t like looking at them either.

            Probably a better middle ground is simply to check in the generated code, but for our tooling to do a better job of hiding generated code from our view when working, so that we don’t have to mentally distinguish between written and generated code ourselves.

            1. 1

              But once you’re making external i/o in the process of the generation step, haven’t you escaped your build/compile sandbox? Even if you can’t execute arbitrary code, accepting arbitrary data may have the same effect and lead to similar dangers of non-reproducibility.

              I think it depends on how you do that I/O, and what you do with the result. For instance, it seems alright for cargo to collect dependencies from the crate repository. Your Cargo.lock file has hashes of all of the things that were obtained last time, so it’ll either get the same dependency files back or it will be a build failure. Unless you’re going to check the build tools into the repository as well (and bootstrap them yourself during your build) you’re already depending on things that aren’t strictly in your source code today.

              In the case of an API definition document, I think it is probably wise to check that JSON file in and run the generation step at build time. Or to use a git submodule, or to get it from a packaging system (like cargo) that will do the requisite hash checking.

              It seems likely that code generation tools could also be invoked as sub-processes with critical privileges dropped, and allowed only mediated access to the build tree through some interface provided by the build tools.

              None of this protects you against the thornier problem, though: that your software supply chain includes software that you didn’t write, and which you have not probably audited line by line. Cargo (or any of the other contemporary package managers for other languages) is perfectly happy to include code that contains subtle exploits within your software if you ask it to. This is, in general, a social problem as much as or more than it is a technical one – that we use software we have not personally vetted. I’m not sure what the solution is, really. No amount of sandboxing during the build is going to prevent the injection of code nobody has yet spotted, from a dependency, that will, sometimes, erase all the files it can see at runtime later.

              I don’t think you’re making the argument that the problem is that storing the generated files is prohibitively costly from a resource standpoint, but more that you just don’t like looking at them.

              I’m making both arguments, and though there is an aesthetic component I don’t believe it’s purely aesthetic. I also think “aesthetic” is a loaded term that gets thrown around in these sorts of arguments in order to minimise another person’s perspective and expectations. One person’s important and objective requirement is another’s mere aesthetic preference, and so on.

              On the subject of size: git repositories only grow in size over time. While shallow clones are possible, they’re not perfect and they can’t be used for everything. While many of us are fortunate to have fast and reliable internet access and large storage devices in our workstations and laptops, not everybody does – and nor should they be required to! Keeping repositories small where we can has quality of life benefits, like reducing the time, bandwidth, and storage space required for clones.

              On the subject of tools, and looking at the repository: I agree that we use tools to examine things, but I also think things are better when we don’t require tools to understand things – when we don’t merely hide the complexity, but rather try to eliminate it. If I have to choose between a repository where each commit is something I can understand by looking at it with git diff, versus a repository where I need a tool to hide all the fluff created by checking in generated artefacts and vendoring things, I’ll always choose the former.

            2. 1

              We don’t check in the result of the preprocessor because that’s part of the C language so no additional tooling is required. A better analogy would be using an API client that uses GRPC under the hood. I should not need to install the GRPC toolchain locally to use that library.

              1. 1

                I don’t think it’s really a better analogy, though – after all, you had to install the C compiler too!

          3. 4

            The isolation can also be achieved at system level, but requires a lot of effort. Selinux can provide a nice solution for per-project setups which can’t own you via build hooks. But writing those for myself was hard and I don’t know how I’d ever generalise that for other people.

            And then if you’re not on Linux, there’s no good solution for it without creating full project accounts and switching into them as needed.

            1. 3

              Are you blogging it something? If yes, I would be interested in reading about such setup.

              1. 3

                No… But maybe I should.

                1. 2

                  Please, please consider doing so. There’s countless resources out there telling people how to disable SELinux when it gets in the way, and almost none telling people how to write or correct a policy.

            2. 2

              OPAM uses sandboxing by default, using bwrap on Linux.

            3. 9

              Also from TFA:

              This attack vector isn’t unique to npm. Other package managers like pip and RubyGems allow for the same thing.

              1. 7

                Yup, Cargo as well. From rust-secure-code/cargo-sandbox#3:

                tl;dr: build-time attacks are stealthier than trojans in build targets, and permit lateral movement between projects when attacking a build system. The threat of a build-time trojan, versus a source code trojan, is an attack that does not leave behind forensic evidence and is therefore harder to investigate. Attacking a build system also potentially permits lateral movement between build targets.

                I don’t know what the state of the working group is, but there was definitely some interest in fixing this for Cargo.

                1. 2

                  Worth noting that the Python world at least has been trying to move away from an install-time “build” step/executable package manifest. The first big chunk of that was the introduction, years ago, of the wheel (.whl) package format, which contains everything – including compiled extensions – already built and organized, so that the install step can just consist of unpacking it and putting files in the correct locations.

                  There also has been support for many years for declarative package manifests; originally as setup.cfg, and now with the package-related APIs being standardized and genericized, coalescing around the pyproject.toml file.

                  1. 1

                    I don’t think wheels is a post install opt out, right? A python package can ship with precompiled extensions and invoke the post install. I don’t think wheels is therefore solving the underlying security issue here, rather avoiding forcing everyone to install gfortran and make/gcc + long compilation times.

                    FWIW ruby gems integrates with diffend, a service which continuously inspects uploaded packages and does automated security auditing.

                    1. 3

                      I know of no hooks in the wheel format (spec is here) which would allow a wheel package to execute custom code on the destination machine, whether as pre-, during-, or post-install.

                      Again, a wheel is literally a zip file containing everything pre-built; installing a wheel consists of unzipping it and moving the files to their destinations (which are statically determined). You can read the full process spec, but I do not see any mention of a post-install scripting hook in there, and strongly suspect you are either misremembering or misinformed.

                      1. 1

                        I’m not super familiar with the spec, but I thought it was possible to release a package with a wheel and a setup.py script, where the post install hook would be defined. I’ll read the links.

                  2. 1

                    Is maven similar? I’m pretty certain it’s doing the same, but I’m not sure.

                    Although, most people use some private repository. But they still initialize it with some public code.

                  3. 6

                    What would an ideal JavaScript dependency management system look like?

                    1. 6

                      It’s a good question. I’m not sure that npm is all that different from most other dependency managers. My feeling is that it’s more cultural than anything – why do JS developers like to create such small packages, and why do they use so many of them? The install script problem is exacerbated because of this, but really the same issue applies to RubyGems, PyPI, etc.

                      There are some interesting statistics in Veracode’s State of Software Security - Open Source Edition report (PDF link). Especially the chart on page 15!

                      Deno’s use of permissions looks very interesting too, but I haven’t tried it myself.

                      1. 9

                        I’m not sure that npm is all that different from most other dependency managers. My feeling is that it’s more cultural than anything – why do JS developers like to create such small packages, and why do they use so many of them?

                        I thought this was fairly-well understood, certainly it’s been discussed plenty: JS has no standard library, and so it has been filled-in over many years by various people. Some of these libraries are really quite tiny, because someone was scratching their own itch and published the thing to npm to help others. Sometimes there are multiple packages doing essentially the same thing, because people had different opinions about how to do it, and no canonical std lib to refer to. Sometimes it’s just the original maintainers gave up, or evolved their package in a way that people didn’t like, and other packages moved in to fill the void.

                        I’m also pretty sure most people developing applications rather than libraries aren’t directly using massive numbers of dependencies, and the ones they pulling in aren’t “small”. Looking around at some projects I’m involved with, the common themes are libraries like react, lodash, typescript, tailwind, material-ui, ORMs, testing libraries like Cypress, or enzyme, client libraries eg for Elasticsearch or AWS, etc… The same stuff you find in any language.

                        1. 4

                          It’s more than just library maintainers wanting to “scratch their own itch.” Users must download the js code over the wire everytime they navigate to a website. Small bundle sizes is a unique problem that only JS and embedded systems need to worry about. Large utility libraries like lodash are not preferred without treeshaking — which is easy to mess up and non-trivial.

                          People writing python code don’t have to worry about numpy being 30MB, they just install it an move on with their lives. Can you imagine if a website required 30MB for a single library? There would be riots.

                          I wrote more about it in blog article:

                          https://erock.io/2021/03/27/my-love-letter-to-front-end-web-development.html

                          1. 1

                            Sure, but that’s just the way it is? There is no standard library available in the browser, so you have to download all the stuff. It’s not the fault of JS devs, and it’s not a cultural thing. At first people tried to solve it with common CDNs and caching. Now people use tree-shaking, minification, compression etc, and many try pretty hard to reduce their bundle size.

                        2. 3

                          I was thinking about Deno as well. The permission model is great. I’m less sure about URL-based dependencies. They’ve been intentionally avoiding package management altogether.

                        3. 2

                          It’s at least interesting to consider that with deno, a package might opt to require limited access - and the installer/user might opt to invoke (a hypothetical js/deno powered dependency resolver/build system) with limited permissions. It won’t fix everything, but might at least make it easier for a package to avoid permissions it does not need?

                          1. 0

                            hideous, I assume

                            1. 1

                              What would an ideal JavaScript dependency management system look like?

                              apt

                              1. 4

                                apt also has install scripts

                                1. 1

                                  with restrictions to ensure they are from a trusted source

                                  1. 4

                                    You mean policy restrictions? Because that only applies if you don’t add any repos or install random downloaded Debs, both of which many routinely do

                                    1. 1

                                      yeah

                                  2. 1

                                    Yes, but when you use Debian you know packages go through some sort of review process.

                              2. 5

                                Small plug for LavaMoat, which includes tools to disable dependency lifecycle scripts (e.g. “postinstall”) via @lavamoat/allow-scripts.

                                npm’s lifecycle scripts have been quite a source of frustration, even before they became a popular avenue for malicious actors. They often include scripts that assume things about their environment, or only work on some particular platform, or pull down remote files that change and move. It’s quite a mess.

                                1. 1

                                  curl | bash I understand. npm and its complexities, nope