There are some official docker images that build everything from scratch and have SHA/MD5 sums to verify downloads. There are others where they pull in official packages. It’s all crazy random and arbitrary and is frustration to no end.
Then there are principled, fully reproducible solutions such as dockerTools:
Which is now my preferred method for making Docker images :).
TIL Docker won’t keep temporary files if they’re created and deleted with the same RUN command. Always wanted to know if several RUNs were equivalent to a single one with &&-separated command.
Yeah, layers are created at the end of the RUN. So if files don’t exist at the end of the RUN, they’re not ever stored.
Reading those RUN commands makes me want to weep - the one that builds Python itself is 75 lines long. Surely there has to be a better solution than that?
Well, I’d imagine using a multi-stage Dockerfile would help. One could install all the cruft and build in one stage, then copy only the built binaries and required files to a different - clean - stage. I’m wondering if they keep it this way because the Docker hub itself doesn’t provide a way to specify a target stage to create the image from? (no idea, but the last time I checked they didn’t).
The commands are combined in the single RUN command to avoid Docker caching the build artefacts in that layer. The looks to be a lot of effort here to limit the layer caching behaviour and clean up build components.
I think this is an excellent candidate for multistage build (https://docs.docker.com/develop/develop-images/multistage-build/) as @dguaraglia mentioned, with all build artefacts including compilers and build dependencies jettisoned as soon as the binary build is completed with the resultant python binaries and associated libraries moved into a fresh container. With this approach, the layer caching from build/compile steps isn’t an issue because the whole thing is destroyed. There may be reasons why this approach wasn’t used.
I also believe there is value in utilising the OS package managers for this. The constant driver for a lot of the source builds in containers seems to be the desire to access the bleeding edge versions that aren’t available in the distribution released packages of the base OS used for the image. In this example, the binary build process could be moved to a .deb package build of the latest source in an earlier CI step with the result stored/published. These .deb packages could then be installed using apt like the base OS components in the container in the same apt-get step. The additional benefit here is the resultant binary .deb can be used consistently inside multiple containers, or even across legacy VMs without requiring a rebuild.
Indeed, I can understand why the build has been done like that, and a lot of effort has certainly gone into cleaning up the built artifacts. Maintaining it must be a nightmare though (I’m guessing it probably doesn’t change hugely from Python-release to Python-release though).
I agree with your sentiments about using OS package managers. Building, for example, a .deb and having that made available for containerised/non-container use would be much easier to maintain, IMHO. Building Debian packages with all of the associated tooling is a lot easier than stringing everything together in a single RUN command.