1. 49
    1. 15

      I think people over-index on the size and volume of PyPI being a serious issue. It is an issue, but in a situation 5 years from now where Fastly is unwilling to continue providing CDN and no one else is willing to step in, there are a number of short-order things the Python ecosystem could do. For example:

      1. A significant portion of PyPI’s traffic is entirely unnecessary: it comes from CI/CD systems that don’t bother to put a caching layer between themselves and PyPI because Fastly has done too good of a job already. If that were to cease to be the case, we’d likely see GitHub Actions, CircleCI, etc. bring up their own caching layers overnight. From recent napkin math, this would bring traffic down by at least 25%.
      2. Much of PyPI’s traffic is caused by package resolution, e.g. pip having to fetch multiple distributions and unpack their metadata to perform candidate selection. Wheels have made this better by (1) pre-baking metadata, and thus (2) making it possible for the index to begin serving sidecar metadata for resolutions.

      Going back to source builds would reduce the overall size of the index, but would undermine larger picture efforts like (2). It would also bring us back to people sneaking binaries into their source distributions. Both of these could in theory be avoided by a new sdist-type format that’s half-wheel-and-half-sdist, which isn’t necessarily a bad idea in its own right! But such a format wouldn’t change the fact that Python succeeds in part because of huge tail of important legacy packages, which won’t be switching away from obtuse, bespoke C/C++ builds any time soon.

      1. 5

        CI/CD systems that don’t bother to put a caching layer

        One of the issues we run into at work is that GitHub’s native action cache is pretty slow. This might be because we use self hosted workers in Australia and the cache is in the US. So we built our own basic S3 tar upload option but I think it’d be nice if package managers provided a buildin “use this S3 bucket as a cache” option.

        1. 5

          Might be worth throwing up this idea at Astral, cuz they might have a good idea of what to do with some of this.

          1. 3

            How does sending data to and from S3 count as a “cache”?

            You know what’s really nice, using a self-hosted runner and the local filesystem.

            1. 5

              How does sending data to and from S3 count as a “cache”?

              It counts in the sense that it’s a cache of our downloads from PyPi. It’s also the simplest approach that a) matches Github’s native caching actions/service (making it an easy drop in replacement); b) works reliably across multiple runners, especially ephemeral ones; and c) is faster than PyPi because S3 is close to our EC2 machines.

              The value of a content-addressable remote cache can’t be understated for all the reasons @rtpg mentioned. Being able to delegate persistence to a single place, freeing up your other components to be ephemeral is valuable. Not just in the context of CI runners, but also, for example, allowing developers to use the same cache for their local workloads. In practice, companies that want this today use Artifactory, Bazel’s remote caching, sccache, etc. I’d really like this to be a native consideration for package managers though.

              Worth nothing that I said S3 since that’s the easiest option for our use case, but having the cache backend be plugable means you can easily have a local directory backend. The key requirement would just be that it has to be a properly atomic content addressable store so that it’s safe to use concurrently. Additionally, there’s no reason there can’t be tiered caching with a preference to use the file system first, really all I’d like is for the ability to have a remote cache option in addition to a local cache option and upstream. We can do this today with git by using mirrors effectively, so why not with package managers.

              You know what’s really nice, using a self-hosted runner and the local filesystem.

              I agree that this is the ideal and I am working on making this how our work CI runners are set up. In practice you do run into issues, mainly due to isolating or not isolating individual CI jobs. The clearest case is when the package manager’s cache directory isn’t safe to be accessed from multiple instances running in parallel, which leads to sporadic CI failures with “file could not be found”.

              I do believe that this is suboptimal state and not a great reflection of software engineering practices, but these are the trade-offs we currently have.

              1. 3

                Often these caches are intermediate steps, so you’re looking at just a whole blob of what you want. For PyPI it wouldn’t be that helpful, though, honestly.

                Like the “canonical” way of doing this for pip is to cache its cache directory. If you pull that in and then run your install, you’re looking at zero calls to PyPI, and only a single bulk download.

                I do think that there should be more done with self-hosted runners and local filesystems, but dealing with CI build machine issues is a thing, and having things like content-addressable remote caches provide a degree of stability that you don’t get with things like machines getting poisoned caches for whatever reason, or otherwise having operational woes downstream (~every project has to deal with “some resource that we didn’t think about got exhausted/expired and now our system is down” about once a year. disk space is obvious, then inodes, some names, DNS entries….)

                You might think this is part of the job, but the end result of the “well we do CI in hermetic spaces” is that people have real CI that works even when the engineering skill level is pretty low. Hermetic by default has generated so much software stability that, in a previous time, people would just be like “well these two people can run the tests on their machine”. And those were present in plenty of real production systems too.

                But at least you can make something correct faster, right?

                (EDIT: to be clear, I like the idea of running everything locally. I just have a hard time getting that to a high enough level of bulletproof that I can feel comfortable with it being set up for everyone. Skill issue on my end)

              2. 1

                One of the issues we run into at work is that GitHub’s native action cache is pretty slow. This might be because we use self hosted workers in Australia and the cache is in the US.

                FWIW I’ve found builds to be faster in GH Actions after removing the caching step and that was running everything on GH infra. Given that not having a cache also makes the action simpler I just went around removing them and it never made anything slower.

            2. 15

              I’ve thought about this in the past. Robust cross compilation would probably make the Python packaging ecosystem more reliable and much less frustrating. It could simplify and consolidate the tooling to make Python more ergonomic and intuitive.

              That said, I think it’s hard to sell a 30-35 year old programming language on moving some critical infrastructure to a pre-1.0 language. Zig seems like it’s got legs, but how many other languages rose and fell in during Python’s tenure? Are the Java people happy to have a build system based around Groovy?

              That’s not to say that Zig is going to go the way of Groovy. I want to see Zig stick around flourish. It’s just a hard sell at this stage in its lifecycle.

              1. 4

                I thought about this some more:

                If you want this to happen organically you could release a tool in the python ecosystem that builds wheels for packages. It would need to be faster, simpler, more intuitive, etc. If you drive enough adoption through the ecosystem from small packages to large packages you can demonstrate first the efficacy and then the stability of the approach. If big enough players are using this in practice and your solution is scalable to lots of types of projects or provides a reasonable/stable interface, then you might be able to make a strong case with the PSF that this build tool should be part of PyPi to reduce costs.

                Of course, this would take a lot longer and be a lot more complicated, but it’s easier and safe to move from the leaves to the roots of a system than to just start at the roots.

              2. 11

                From my limited experience, Zig provides the easiest to use C/C++ toolchain, especially for cross-compiling.

                It feels odd that a different language is more convenient than any C or C++ toolchain, for C/C++ code. Am I wrong? I feel like I must be missing something.

                1. 10

                  It pursued that goal very explicitly!

                2. 8

                  prebuilding binaries causes an exponential growth of the amount of data PyPI has to store

                  how so? if you have n platforms and k releases and t total packages, aren’t you looking at a cubic growth that’s proportional to n * k * t?

                  i presume the exponential trend the authors witnessed is the exponential growth in the t, the total number of packages hosted by PyPi over time possibly.

                  1. 1

                    Or (more likely in my experience) the growth is not actually exponential – it is just FAST. In some cases, people use the term “exponential growth” to mean “rapid growth”. In other cases, people look at the beginning of an x^3 curve and notice it curves sharply upward – you have to follow it a little ways before x^3 becomes clearly smaller than a^x.

                    I don’t think it matters because x^3 growth is probably ALSO to expensive to pay for, even if it is a lot slower than exponential growth.

                  2. 5

                    PyPI does not do any compiling or building – this is all done by package authors & maintainers who choose what wheels to upload. Turning PyPI into an open resource for arbitrary binary builds would be a huge increase in operational and engineering effort, regardless of what technology is doing the building. Furthermore, given the long tail of legacy packages and their arcane build stories, the level of effort to achieve any degree of reliability across the ecosystem would be massive.

                    The “pypackaging-native” site has some info on the benefits that a build farm for PyPI would provide, but also acknowledges some significant challenges:

                    However, when wishing to follow the widely-used practice among distributions to rely predominantly on shared libraries, this brings a massive further increase in necessary automation and maintenance […] In particular, doing so needs careful tracking which packages are exposed to any given (shared) library’s ABI, the ABI stability of that library as expressed through its version scheme, rebuilding all dependent packages once an ABI-changing library version is released, and finally representing all that accurately enough in metadata so that the resolver will only choose compatible package combinations for a given environment. pypackaging-native.github.io

                    I highly recommend anyone interested in understanding the depths of challenges related to packaging Python code with non-Python components or dependencies to give that whole site a look.

                    1. 4

                      Wow, tf-nightly package weights 4.7TiB, never seen such huge packages.

                      1. 3

                        Various security-conscious processes and people will pin particular versions of binary artifacts - rebuilding a (hopefully-)equivalent binary will set off lots of alarm bells. E.g. I recall OpenBSD ports running into this issue when, IIRC, GitHub changed the way it generated source code dumps from commits.

                        So this proposal likely does need reproducible builds, not merely repeatable builds (as claimed.)

                        (And e.g. quality control is also a reasonable motivation for pinning binaries!)

                        1. 3

                          The real problem is C doesn’t have a build system, it has some build-system-like tools upon which everyone creates their own boutique build system. So you never know how to actually correctly build a C program, let alone do it cross-platform. Zig is brave as hell for trying to make an actual open-source C build system, something that nobody in the last 50 years or whatever has actually tried very hard to do. (oh, and it also has a neat programming language attached.)

                          1. 2

                            The author briefly mentions this under “We saw the beginnings of this in the Zig project and immediately moved to a self-hosted solution”, but I want to highlight that we should not forget that hyperscalers’ egress prices are egregious. OVH offers “10Gbit/s unmetered and guaranteed” as a $550/mo add-on per server. Having the port utilized at 50% only, you can transfer 1643.625 TB through it, which costs $56586/mo on Cloudfront. Sure, comparing a full-blown CDN and a 10 Gbit port is a bit apples to oranges but the difference is a staggering 100x!

                            Somehow, Debian manages to keep distributing pre-built binaries and is not planning to go back to forcing everyone to build from source. One important bit about pre-built packages is speed and convenience. Just think how long would it take for everyone to rebuild numpy and tensorflow on every download (cannot imagine doing it on my poor Raspberry Pi 3 B+).

                            1. 2

                              Note that you can combine the ability to build from source with prebuilt cache artifacts. Just look at Nix for example. Certainly it would make sense to cache binary builds of expensive and commonly fetched packages. How much of that massive storage data is infrequently fetched, however?

                            2. 2

                              I do wonder how companies could be incentivised to do the “right thing” interacting with PyPI, especially in CI. PyPI “just works” for me but would I pay for a better version of it, even for my own use? I think so!

                              Regarding build suites though, is the proposal here to have Zig just be available in the Python toolchain? Like the prebuilt binaries aren’t just about LLVM, but about third party deps that are needed (for example, do you want to talk to postgres? You’ll need some third party lib headers for compilation). My impression is that the precompilation is also about alleviating those pains (because pip and friends can’t just install random .debs on your system)

                              1. 2

                                is the proposal here to have Zig just be available in the Python toolchain?

                                That’s already available. Zig is already on PyPI, you can get it just by running pip install ziglang.

                                Like the prebuilt binaries aren’t just about LLVM, but about third party deps that are needed (for example, do you want to talk to postgres? You’ll need some third party lib headers for compilation)

                                In the specific usecase of a postgres client, I think the Python package should just bundle those headers, but more in general the Zig compiler can download all the dependencies you need (assuming those have been packaged as well). As an example, look at this build.zig.zon file it mentions 3 dependencies:

                                • Haivision/srt, the upstream project
                                • mbedtls
                                • googletest

                                When you run zig build all these 3 dependencies are downloaded and their build.zig is run if present (the first one doesn’t have a build.zig since it’s just the vanilla upstream project that we are providing a build script for).

                                The work to package everything must still happen, but once it’s done correctly you get the ability to build from any host for any target.

                                1. 1

                                  Thanks for that example. It’s hard for me to know for certain to what extent this is workable across the board (Like do these kinds of headers end up getting patched by distros to deal with distro-specific issues?), but like with a lot of packaging things, if you’re able to get most of the way there for the “easy” cases, then the trickier setups will be all that remains.

                                  I do feel a bit odd about this strategy overall though. Like Python setup scripts can compile stuff from source, but the current state of the art is that it does it with LLVM and friends. It feels like we should be able to just start pointing to zig for these compilations without saying “now you need to learn these zig-isms and use its build tooling”. You can write a C extension in Python without referring directly to the used compiler after all.

                                  There’s maybe a higher-level thing here about saying that Python doesn’t have a great way of pulling in other language dependencies in the same way Zig does. Just feels like an opportunity to improve the Python tooling rather than say “we should just use zig’s build tooling”.

                                2. 1

                                  is the proposal here to have Zig just be available in the Python toolchain?

                                  Zig has been distributed via PyPI for quite some time now: https://pypi.org/project/ziglang/

                                  The proposal is essentially to lean more on storing and sending source rather than binaries.

                                  Related: cargo-zigbuild

                                3. 2

                                  I can’t be the first person to suggest this, but why not use bittorrents for package distribution?

                                  1. 3

                                    I’ve played with this off and on a few times before, and I think it’s a really good idea but not a trivial one. The main problem with any DHT system is that there’s always a startup lag. It takes a few seconds for a system to figure out what other nodes are around, make sure they have the files they want, and start asking for them, and that’s in the case that the network is working perfectly. So if you’re downloading 1 10 GB file adding a few seconds onto the beginning of it is fine, but if you’re downloading 10,000 1 MB files then you add a few seconds onto the beginning of each one. Having multiple requests in flight at once helps, and having a centralized tracker like Bittorrent helps, but neither eliminates it entirely.

                                    1. 1

                                      Hmm… Sounds like what is needed is a protocol that allows that startup time to be shared across the large number of small files that are being downloaded. Something similar to the way HTTP/2 allowed multiple requests to be multiplexed over the same connection.

                                      1. 3

                                        The protocol is probably compatible with that, but the bootstrap time for bittorrent DHT is on the order of a minute, i.e. longer than it takes to download an entire dependency tree, so parallelism isn’t enough.

                                        1. 3

                                          Yeah I was pondering the possibilities of something along these lines. You can have nodes opportunistically share info, so that if you ask for package X then the nodes that have it tend to have its dependencies, or at least know where to find them. Then it more or less turns into a caching problem.

                                          All of this is basically reinventing the CDN, fwiw. It’s just a CDN that anyone can help operate.

                                      2. 1

                                        You need a certain level of mass adoption to ensure that the packages are available. Otherwise, small packages can fall out of the torrent. Network effects become existential.

                                        1. 2

                                          In theory (and practice when I’ve tried it) one seeder is enough for trackerless bittorrent. So you don’t need mass adoption, just a package repository that seeds everything.

                                        2. 1
                                          1. impl complexity

                                          2. DHT is much slower to get peers and start download than traditional methods.

                                          3. for tracker backed you need one more service to support with HA

                                        3. 2

                                          I’m not convinced that this approach has any widespread adoption in the rust ecosystem when lib.rs lists the linked crate as having 3 direct dependents.

                                          Also, will Zig remain a good cross-compilation tool when the backend is rewritten? Why replace the whole compiler when “all” that’s actually needed is a build system and dependency management, especially with a pre-1.0 project that many python devs might not have heard of.

                                          This article has some good points, but it seems like the author shoehorned their/their orgs project in.

                                          1. 1

                                            I’m not convinced that this approach has any widespread adoption in the rust ecosystem when lib.rs lists the linked crate as having 3 direct dependents.

                                            It’s a tool that you install and use, not something you depend on. The relevant number is the 72k downloads per month, which apparently makes it the 18th most popular Cargo plugin. Which is a decent amount, but doesn’t seem that popular.

                                            Also, will Zig remain a good cross-compilation tool when the backend is rewritten? Why replace the whole compiler when “all” that’s actually needed is a build system and dependency management, especially with a pre-1.0 project that many python devs might not have heard of.

                                            Would you even want the build system? I don’t think a Python package repo would have much success telling developers to re-write the build systems of their C dependencies…

                                          2. 1

                                            One nit:

                                            But the button is wrong, this future is all but inevitable …

                                            This implies the future is inevitable, not the opposite as you want to assert

                                            1. 2

                                              Ah thank you very much, pushed a fix.

                                              1. 3

                                                Just for fun, I can explain the linguistics on this idiom. At first glance it seems to mean not inevitable, because it is the set of things that does not include inevitability. However the implication is that the set of all possibilities includes many many states very close to inevitability, even if it does not include inevitability itself. So the idiom “all but inevitable” could be rephrased literally as, “as close to inevitable as possible, while technically not being 100% inevitable” or in other words, extremely likely to occur.