1. 8

  2. 7

    I’ve been watching the Python community build out pypi of late. I sincerely don’t know how you make a community both welcoming to new contributors and first time module authors and yet safe from this kind of attack.

    I’m not sure it’s a solvable problem.

    1. 4

      Idk, I feel like Linux distros have been doing a pretty good job for decades. It seems it’s far harder to compromise GPG keys than it is to compromise a GH/PyPi/etc. login. The real problem is there’s not a low-barrier to entry way to getting mass adoption of a package system using GPG because the ergonomics are awful.

      1. 9

        They have! But they’ve done so with a tremendous trade-off in terms of time to release. If that works for your use case, fantastic! Rock on with your bad self! But there are other use cases where getting the very latest code really IS important.

        The distro model also relies on the rarefied fairy dust that is the spare time, blood sweat and tears of distro / package maintainers, and thus doesn’t scale well at all.

        1. 5

          I think a big part of that time trade-off comes from the fact that distro maintainers do a lot more than build and publish packages, they test that they all build together, don’t break distro functions, etc.. IMO the real issue issue with weakly secured package repositories is that it’s a big burden to get package developers to just sign their packages. The ideal package repository for me does the following:

          • packages must be cryptographically signed by one of the authors
          • signatures are validated by package managers at download/install time
          • new versions of an existing package must be signed by a key in the same signature chain(s) as the last published version except in the following scenarios
            • explicit handoff of ownership via a token signed by the previous key that contains the signature of the root of the new chain, subsequent packages can be signed by either key unless the token includes a revoke of signature rights flag that prevents the previous key from being used
            • to support lost keys, the repository administrators can sign the same type of token mentioned above after a verification step (such as verifying ownership over the email attached to the GPG key, signed tag on relevant git repo, etc.)
          • packages are namespaced with repo username or group by default. This supports forks and forces an acknowledgement of the owner(s) of a package onto the user. Most git hosts work this way anyways

          The only real barrier to doing something like this is adoption due to overhead of creating and maintaining signing keys on the publisher’s end. Part of the reason npm/pypi/etc. are so ubiquitous is there’s basically zero barrier to entry, which is not what I want my software to rely on.

          1. 8

            Now factor several other variables into your ideal:

            • Most packaging systems are built by volunteer help on volunteer time
            • They need to operate at crazy bananas pants scale. Pypi had 400K packages at last count I saw.
            • People have legitimate needs for development purposes of being able to get the VERY latest code when they want/need it.

            I think all of what you’re saying here is spot on, I just don’t know how you actually make it real given the above. You’re comparing to the Linux distro model, where the entire universe of packages is in the 30-60K range according to the web page I just saw.

            1. 2

              Most packaging systems are built by volunteer help on volunteer time

              True, but there’s more complex ambitious projects (like Matrix) that are also built by volunteers. Hell, you could probably build a sustainable business model by selling access to such a repository in a b2b fashion.

              They need to operate at crazy bananas pants scale. Pypi had 400K packages at last count I saw

              I mean, yeah? It’s still read-heavy which is easier to scale out that write-heavy systems.

              People have legitimate needs for development purposes of being able to get the VERY latest code when they want/need it

              This requirement isn’t really mutually exclusive with my ideas above. If you’re saying you need to operate on the latest unpublished code, you should just clone master of the code itself and go from there. I’m not saying you have a group of volunteers (or employees) comb through published packages and sign them themselves, I’m saying you force signatures of any package uploaded to the repo from the person who wrote the code and is publishing it. The obvious problem with that being adoption because who wants to go through the bs process of setting up GPG/PGP keys, it’s a pain.

              1. 6

                I hardly think it’s fair to say Matrix is developed by volunteers…

                1. 1

                  Who is it developed by then?

                  1. 3

                    New Vector Limited

        2. 8

          …GPG because the ergonomics are awful.

          I’ve had probably a dozen keys over the years, many of which were created improperly (e.g. no expiration) because I was literally just doing it to satisfy some system that demanded a key.

          So, on top of the bad ergonomics around GPG in general, you also have the laziness / apathy / resentment of developers who didn’t actually want to create a key and view it as an annoyance to contend with. Like, how long do we think it would take before people started committing their private keys to avoid losing them or having to deal with weird signature chains to grant access to collaborators?

          1. 3

            PyPI already supports PGP-signing your packages, and has supported this for many years. Which should be a big hint as to its effectiveness.

            1. 1

              Not just supporting PGP/GPG-signatures, enforcing signatures. And yeah, that ecosystem sucks.

              1. 5

                Tell me how you’d usefully enforce in an anyone-can-publish package repository like PyPI. Remember that distros only manage it because they have a very small team of trusted package publishers who act as the gatekeepers to the whole thing, and so there’s only a small number of keys and identities to worry about.

                In an anyone-can-publish package repository it’s simply not feasible to try to verify keys for every package publisher, especially since packages can have multiple people with publish permissions and the membership of that group can change over time. All you’d be able to say is “signed with a key that was listed as one of the approved keys for this package”, which then gets you back to square one because an account takeover would let you alter the list of approved keys (and requiring that changes to approved keys be signed by prior approved keys also doesn’t work because at the scale of PyPI the number of lost/expired/etc. keys that will need to do a recovery workflow would be enough to still allow the basic attack vector that worked here — take over an expired domain and do a recovery workflow).

                1. 1

                  packages can have multiple people with publish permissions and the membership of that group can change over time

                  Yes, I didn’t go into detail because it’s a lobsters comment, not a white paper, but the idea is that only a revoke/removal of a key from the approved keylist of a package can be done without a signed grant from a previously supplied key. What this means is the first person to upload a version of a package will sign it, then that key will have to be used to add any additionally allowed keys via a signed token grant. Allowed keys are explicitly not tied directly to group membership (except maybe an auto-revoke being triggered by a member being removed from a group), or really accounts at all. Handling the recovery workflow is the hardest part to get right. In the case of an expired key, supplying a payload from the email attached to the key and account (should probably also enforce key emails and account emails match) signed with the expired key is significantly better than simply sending a magic link with a temporarily URL. For supporting lost keys, I can’t think of a way to support this safely without basically just making a new “package lineage” that has a new namespaced account or something. Either way, the accounts would still only be as secure as the security practices of the users on the publishing end, so there’s only so much you can do.

          2. 2

            I don’t understand why we stick to flat namespaces, or rather, it implies separate authentication. What’s wrong with the Go way of doing things? Why can’t we go directly to GitHub (and friends) for our dependencies, instead of having pypi / npm / cargo inbetween?

            1. 3

              I guess the only problem that solves is typosquatting? Because maintainer account compromise and repojacking will still get you malicious code.

              1. 1

                This topic brings out strong opinions on all fronts :) See @ngp’s eloquent statement of the exact and total opposite opinion that we should have MORE in between, not less.

              2. 2

                An open community is not defined by a single central register of packages where all dependencies are pulled from by default by just adding some sort of identifier in your project.

                It is a solvable problem and it has been solved. We just broke it relatively recently with this horrible idea of pulling a tree of dependencies with hundreds of nodes, whenever we want to left pad a string representation of an integer.

                The solution is: don’t import arbitrary dependencies dozens at a time just because there is a simple way to do it. It was never a good idea. Not that package managers are a bad idea per se. It.s.tge way they’re [ab]used. The means to do it can perfectly be there, just use them reasonably.

                Pearl’s CPAN was probably the first instance of these central package repositories. But it always posed itself as a convenience with no authorative instance. Multiple mirrors existed with different sets of packages available. It was just always an easier way to download code, not an hijacker of a programmings language import routine.

                1. 1

                  Guessing you mean “perl” but point taken.

              3. 6

                “Popular” and “heavily downloaded” are word choices in this article.

                If you read the actual security advisory from PyPI, which this article conspicuously fails to link to, you will find information about

                • How many other packages (one) had this one as a dependency.
                • The package download count. The article claims “over 20,000 times a week”, the advisory points out the historical numbers were way lower and a recent spike was due to pushing new releases which then got picked up by caching PyPI mirrors all around the world.
                • How the compromise happened (the domain associated with the original developer’s publicly-viewable email address expired, someone else re-registered the domain and ran a password reset flow).
                1. 3
                  1. 2

                    I typically like a good proof of concept, but I found this to be fairly childish. The write-up itself doesn’t tell the reader much they don’t already know (if you control the account, you control the package) and this proof erred on the side of being malicious:

                    Also, all these processes finished in a few days using free AWS EC2. Also, creates.io is crashed when I’m scraping all the data sorry for that.

                    He didn’t throttle his scraper and crashed a public website while doing this. He should have slowed it down.

                    But my latest versions download less than “0.1.2” version so I thought that some packages use CTX package on “requirements.txt” file and I modify “0.1.2” version with mine code.

                    His choice of “proof” was to mine OS variables that could be extremely lucrative if he wasn’t on the up-and-up. The only warm fuzzies we get that he didn’t do something with the credentials is just the assertion “ALL THE DATA THAT I RECEIVED IS DELETED AND NOT USED.” He should have sent some dummy value.

                    There are many good resources about vulnerability.

                    This is just a list of news stories talking about his proof of concept, not the vulnerability. To me, all of this reads as an ego that has been couched after the fact as security research.