1. 41
  1. 8

    A lot of this just seems like going against the grain of the distro when using Docker and wondering why that’s not good.

    1. 20

      Isn’t it how everything goes? People are not satisfied with this distro, they create another. People can’t bother with their distro, they create npm, pip, …. People realize language specific ones are not enough, they create conda. People think conda is bloated, they create miniconda. People can’t bother with installing anything, they use Docker. People still need to install things inside Docker, they choose a distro inside the Docker. Ad infinitum.

      1. 5

        People realize language specific ones are not enough, they create conda.

        It is reasonable to ask to manage language-specific packages together with other libraries. Many language packages rely on various C libraries.

        I think this is mostly a failing of traditional distribution package management. If you still insist on writing a .spec or rules file for e.g. every Rust crate that a developer might use from crates.io [1], you are never going to keep up. Additionally, traditional package managers cannot deal well with installing different versions of a package in parallel. Sure, you can hack around both problems, e.g. by automatically generating .spec files and sticking versions into package names, so that ndarray 0.13.0 is not seen as an upgrade of ndarray 0.12.0. But it’s going to be an ugly hack. And you still cannot pin system libraries to particular versions.

        So, you either have to accept that your package manager cannot deal with language package ecosystems and project-specific version pinning. Or you have to change your package manager so that it is possible to programmatically generate packages and permit multiple parallel versions. While they may not be the final solution, Nix and Guix do not have this problem and can just generate packages from e.g. Cargo metadata and deal with multiple semver-incompatible crate versions without any issues.

        [1] Deliberately not using Python as an example here, because Python packaging has many issues by itself, such as not permitting use multiple versions of a package by a single Python interpreter.

        1. 2

          This is mostly a failing of the python ecosystem or our software ecosystem as a whole. The very idea of wanting a set of “packages the precise set of things you want” (borrowing from another commenter here) is absurd. Users should pin their blames to those developers releasing packages without any sort of backward compatibility. If there are 10 packages release backward incompatible “updates”, the users got theirselves 1024 choices to choose from. Somehow, people still think it’s a mere package management problem. No.

          1. 9

            Users should pin their blames to those developers releasing packages without any sort of backward compatibility

            No, this is a solved problem. Cargo and npm do work with lots of messy dependencies. It’s the old inflexible package managers that aren’t keeping up. Traditional package managers blame the world for having software not fitting their inflexible model, instead of changing their model to fit how software is actually released.

            Cargo/npm have solved this by:

            1. Using Semver. It’s not a guarantee, but it works 99% of the time, and that is way better than letting deps break every time.
            2. Allowing multiple incompatible versions of the same package to coexist. In large dependency graphs probability of a conflict approaches 1. Package managers that can’t work around conflicts are not scalable, and end up hating people for using packages too much.
            1. 1

              Right. Rust and Node solved everything. The rest of the world really can’t keep up. Why don’t we just rewrite everything in Rust and Node? You can have your package links with libA and libB, while libA links with libD.so.0 and libB links with libD.so.1. Wait, the system still has an ancient libc. Right. Those .so files are just relic from the past and it’s a mystery we are still using them. So inflexible.

              Cargo/npm have solved this

              It truly made my day. Thanks. I needed this for the weekend.

              1. 3

                .so files are just relic from the past

                Yes, they are. Sonumbers alone don’t help, because package managers also need a whole requires/provides abstraction layer, renamed packages with version numbers in the name, and evidently they rarely bother with this. Lack of namespacing in C and global header include paths complicate it further.

                To this day I’m suffering distros that can’t upgrade libpng, because it made slightly incompatible change 6 years ago. Meanwhile Cargo and npm go brrrrr.

        2. 2

          The problem is at least partially social, not technical. It’s far easier to start a new project than join an existing one, because you have to prove yourself in a new project. For some projects (say, Debian) there’s also significant bureacracy involved.

        3. 16

          Docker is irrelevant to the message of the story. They just didn’t have a box with Ubuntu 18.04 around to demonstrate the problems and resorted to a Docker container. A VM or bare metal with 18.04 would have told the same story.

          1. 5

            Could you expand? The general issue is the mismatch between distro release cycles and software development cycles: you need to bootstrap a newer toolchain in most languages. Python’s has some specific problems others don’t (e.g. no transitive pinning by default, unlike say Cargo, so new package releases are more likely to break things), but every unhappy family etc..

            1. 2

              If you want specific versions of a tool, then perhaps it would be better to switch to a distro that offers that, or remove the distro from the equation.

              1. 8

                What if there is no distro in existence that packages the precise set of things you want? The whole point of language packaging ecosystems is to solve that problem; otherwise the options are “find a way to stick to only what your distro decided to package”, or “make your own distro that system-packages the things you want”, or “make a distro that system-packages every version of everything”.

                And that’s without getting into the fact that distros historically change popular languages and their packages, sometimes in ways that are not compatible with upstream. For example, Django finally gave in and switched the name of its main management command from django-admin.py to django-admin, in part because some distros had absolute hard-line policies on renaming it to django-admin when they packaged Django, leading to a mismatch between what Django’s documentation said and what the distro had installed.

                And it’s especially without getting into the fact that many Linux distros ship languages like Python because some of the distro’s own tooling is written in those languages, and you want clean isolation between that and your own application code. Which you can’t get if you go the system-package-only route.

                So yes, even in Docker, you should be using language packages and the language’s package manager. It’s not that hard to add a pip upgrade line to your Dockerfile and just have it run on image build.

                1. 2

                  What if there is no distro in existence that packages the precise set of things you want?

                  I feel like this is covered by the second half of the statement:

                  or remove the distro from the equation.

                  That is done - for instance - by using language packages.

              2. 1

                no transitive pinning by default

                This is why I decided to go straight to Poetry. The ride has been bumpy (python is a chaotic world when it comes to package metadata), but at least now we have a lock file. Rust really hit a home run with Cargo.

              3. 3

                Why does it have to be one or the other?

                I want to be more up to date on Python and it’s packages than other things like the libc, kernel version, …

              4. 6

                If you check the PyPI files for Fil, you’ll see there are manylinux2010 wheels, and no source packages at all; because building from source is a little tricky, I only distribute compiled packages.

                A side-note but this bothers me a bit. What if the user is on FreeBSD? Or they are on Linux and prefer to compile from source. This is taking the docker route where people download opaque binary blobs and hope it does what it says on the label. Except that the binary is not even sandboxed.

                1. 4

                  I explicitly don’t support FreeBSD, or rather, explicitly only support Linux with glibc and macOS, for two reasons:

                  1. It’s… a tricky project, involving LD_PRELOAD (or macOS equivalent), bugs can be like “clean up of thread locals hits wrong code path, leading to segfault”, or “C thread locals on this platform allocate, so I have to use pthread thread locals” instead. Given limited time, I want to focus on fixing issues in platforms people use.
                  2. The audience, people doing data science, data engineering, and scientific computing, are most likely to use Windows, macOS, and Linux. Insofar as I was going to spend more time on a third platform, it would be Windows, and maybe musl for Alpine Linux if someone made a compelling case.

                  My other projects have source distribution as you’d expect, and should work anywhere.

                  1. 1

                    What if the user is on FreeBSD?

                    First, keep in mind this only applies to packages that include extensions written in non-Python languages. But if it’s not a platform that the wheel format can support, then the answer is they install from a source package (.tar.gz) instead of a pre-compiled .whl, and the installation will include a build step and require a compiler toolchain and any necessary libraries to link against. Pure Python packages don’t have this issue.

                    The base problem really is ABI compatibility and stability. The various iterations of the manylinux platform tags have been based around distros which all shipped glibc (which enables built artifacts that can dynamically link against multiple versions), and an extremely conservative set of other shared libraries.

                    Or they are on Linux and prefer to compile from source.

                    They can install from a source package instead of pre-compiled .whl. In fact this is what Alpine users have to do, since Alpine is not a glibc distro and thus not ABI-compatible with manylinux packages. You can force the use of source package with the --no-binary command-line flag to pip.

                    This is taking the docker route where people download opaque binary blobs and hope it does what it says on the label.

                    The alternative is to download source you know you’re never going to manually audit, compile it and hope for the best. And your first complaint was “this platform can’t get pre-built binaries”, now your complaint seems to be that pre-built binaries are bad anyway.

                    (you can do integrity checking of packages at download time, incidentally, and the Python Package Index also supports attaching cryptographic signatures to uploaded packages, but as someone who for many years produced most of Django’s packages and signed every one of them, I can also count on one hand the number of people who I know actually made use of those signatures)

                  2. 4

                    This is also true for macOS. Had issue with Big Sur and NumPy. An upgrade of pip fixed it.

                    1. 2

                      Just focusing on the Docker part of the story, but this feels like “Docker is a VM” type of mindset, instead of “Docker is an application container”. The whole distro should/must be irrelevant if I want to package some Python tool.

                      1. 1

                        Can you describe some tools or techniques that achieve that?

                        1. 2

                          Away from computer, but I would use Nix, Guix or Pex (never used the last one but my ex manager was claiming that it’s the only true path).

                      2. 2

                        I still find it absolutely crazy that pip installs things to /usr without any warning if you innocently run it as root. Does any other language package manager do this? I’m used to cpan installing everything in /usr/local, which is far more sensible, and reversible. Is there a good reason for installing to /usr that I’m missing?

                        1. 9

                          pip doesn’t install to /usr, it installs to “the place Python is installed”. Which if you’re using Ubuntu’s Python is /usr. If you’re using the Python in the official Docker images, it’ll be /usr/local. If you’re using the Python from a virtualenv isolated environment, it’ll be the directory the virtualenv is in.

                          People have had discussions about disallowing pip usage in distro-installed Python, but it’s more in the realm of vague thoughts than anyone writing code to prevent it.

                          1. 3

                            Interesting, thanks. I still think this is unusual behaviour though, given most python installations are found in /usr. I almost always install into a virtualenv, but on a few occasions I have forgotten to enable it and accidentally filled up my distribution’s /usr with random python packages