1.  

    I recently reread “Programming Pearls” and it’s very definitely worth reading even now, almsot 40 years after many of the examples he gives.

    1.  

      I think it’s interesting how many “oldie but goodie” books are on this list. And rightly so. I had Code by Perzgold sitting in a guest bathroom and a guest came out holding it and saying how good it is. And the book is almost 20 years old.

    1. 5

      I sympathize with this perspective very much, and I have a hard time working on seemingly-interesting problems when I don’t care about the outcomes or users.

      But—there are plenty of easy, important problems that involve building the same web app over and over again, and, I just can’t do that. I deal really badly with boredom, and I get bored really easily when I’m not learning.

      So at least part of my personal career explorations have been about “how can I both work towards worthwhile goals, and not end up bored out of my skull”.

      1. 1

        Most interesting thing is working on my memory profiler for Python data processing applications (https://pythonspeed.com/products/filmemoryprofiler/), which is a combination of Rust and Python APIs. I’ve gotten it running at minimal acceptable performance (code runs at ~50% of normal speed), just added threading support and first pass of end-to-end tests.

        Probably next step is spending some time on marketing and outreach trying to find some interested testers.

          1. 1

            Yeah, I need to add a section pointing out that the empirical (lots of OpenBLAS bugs) and theoretical reasons that suggest this is a general pattern.

          1. 8

            The build time and research arguments are valid and definitely good to keep in mind. However, you can make smaller alpine images by following similar patterns. The Dockerfiles in this article don’t use –no-cache with apk and they leave around the development files after the build is done.

            The following image should build something similar, though, as mentioned, it takes a while to build. I’ll update this post with the final image size when it’s done.

            FROM python:3.8-alpine
            
            ENV PYTHONUNBUFFERED=1 \
              PYTHONDONTWRITEBYTECODE=1 \
              PYTHONHASHSEED=random \
              PIP_NO_CACHE_DIR=off \
              PIP_DISABLE_PIP_VERSION_CHECK=on
            
            RUN apk add --no-cache freetype libpng openblas
            
            RUN apk add --no-cache --virtual .build-deps gcc build-base freetype-dev libpng-dev openblas-dev \
                && pip install matplotlib pandas \
                && apk del --no-cache .build-deps
            

            EDIT: strangely, the build wasn’t also made slowed by untaring matplotlib-3.1.2.tar.gz which is a 40MB file with lots of small files. That’s not to say the build was fast, but it’s worth noting.

            In any sense, the final build size as reported by docker image ls was 469MB.

            1. 3

              This does result in a smaller image, but it means every time you change your Python or APK dependencies you need to reinstall both, without relying on Docker layer caching.

              The image size honestly isn’t a big deal, but the build time is brutal.

              1. 3

                So, I finally got back the final sizes and it actually surprised me. 469MB for the alpine version I posted. Much better than your 851MB, but also larger than python:3.8-slim. Maybe it’s leaving around the source files somewhere? Have you made sure that your python:3.8-slim version can actually run code using matplotlib or pandas? I’d assume that they’re missing the actual libraries needed to run the code (essentially the non-dev versions of what you had to install with alpine).

                At this point, I’m not really willing to take more time to investigate this - you’ve sold me. I’m planning on moving all my images to the slim variant if I can.

                1. 3

                  I’ve wasted a lot of time on this as well, and I can only recommend you do that. So much less time spent in Dockerfiles, and more in dot-py-ones…

                  I ended up with a Debian-based base image with the build dependencies installed and often used dependencies in a pre-created virtual-env, an Ansible-based toolchain to be able to quickly install any additional system and virtualenv python-package dependencies (which will most likely be overkill for most), and a script that is called as last step in every child-Dockerfile and which cleans up most of the junk that is only used at build time. You could probably also do a lot of that with a multi-stage Docker build.

                  Anyway, that makes for fairly quick build times, and quite small images. The base image might not be that small, but since it doesn’t chage that often, and all your other images depend on it and can re-use the cached layer, I only have to pay that cost once (per deployed project), and when it updates…

            1. 25

              This issue is not specific to Alpine, Wheels just suck with any less-common ABI/platform combination. Try it with Debian on ARM and you will run into the same issue.

              1. 12

                Yea, it’s not a fair comparison because the OP is using a package with some base extensions and therefore has to pull in gcc and compile those CPython extensions. A better comparison would be to compare something that had the actual python package available in Alpine.

                Or, use a program you wrote whose dependencies are all pure-python without needing to link to outside C resources.

                1. 15

                  It might not be fair, but it is relevant. At least in my view. This article mirrors my experience, and I came to the same conclusion. It’s quite rare that I don’t have any library that uses C in my projects, but for which no wheels are available for a ‘standard’ platform. I never had use for any of the Python packages that were included in Alpine (nor the ones that come with Debian, for that matter. I usually try to pin my dependency versions to the exact ones I use in de velopment).

                  I’ve experienced a lot of pain trying to use Alpine with Python, and I don’t think it’s worth it for anything other than small, simple projects. Even for building Python-based containers for functions in OpenFaas (where you’d really like to have small images), I ended up having a (relatively heavy) base image with most of my common requirements included, and used that with all my function containers. Which, in the end is an acceptable solution, since you only have to pull the base image every now and then, when it updates. The base images weights around 350mb, but every image derived from it was only a few kBs or (MBs, in some cases).

                  Anyway, if you can fault the article for anything it might be for not making it clearer when it’s conclusions are relevant. But I think most people will understand that, and won’t expect a ‘fair’ comparison for clean-room circumstances. As I said, it might not be fair, but it’s good practical advice nonetheless, in a lot of cases.

                  1. 2

                    Exactly my feeling. The original post and liwakura’s comment are both fascinating when I don’t have this problem, but I’d wager if someone is searching for “Alpine Docker Python Slow” it’s not going to be for the other edge cases.

                2. 12

                  The difference is that no one recommends ARM as a way to make smaller, faster builds. It’s understood you’ll have more work because it’s more obscure. But people do recommend Alpine on amd64 for faster, smaller builds, and that wastes lots and lots of time for people who don’t know any better.

                1. 4

                  If you’re using Alpine Linux you need to compile all the C code in every Python package that you use.

                  Is this misleading or am I missing something here? There is an Alpine Linux package for pandas, and I’d guess for most popular Python packages. Is there a reason one would prefer to use pip nonetheless?

                  1. 14

                    In my experience of production Python usage, in many organizations over past 15+ years, very few teams uses system packages. Upstream packages from PyPI (or Conda) are much more common because they’re much more frequently updated and much more complete.

                    Where do you see a Alpine package for matplotlib or pandas, BTW? I can’t find them. I can find NumPy, and it’s 1.17.4 in Alpine and 1.18.1 on PyPI.

                    1. 4

                      Their package search isn’t super discoverable. I randomly got a pop-up to use * and managed to find them this way:

                      https://pkgs.alpinelinux.org/packages?name=*pandas*&branch=edge

                      https://pkgs.alpinelinux.org/packages?name=*matplotlib*&branch=edge

                      It looks like they’re newer and only available in edge at the moment though

                    2. 5

                      Perhaps they need to peg their dependencies to a specific version, either for reproducible builds or for ease of maintenance.

                    1. 1

                      My experience with BuildKit, which is backend for a bunch of these tools, is that there’s a bunch of edge cases that aren’t handled well. It’s gotten somewhat better every time I’ve tried, but it still has different behavior than normal Docker builds.

                      1. 2

                        Why do people put up with Conda instead of using mature, community-supported tools like Nix? It seems like all of this trouble with using Conda here stems from poor ergonomics inside Dockerfiles. Compare and contrast:

                        FROM nixos/nix
                        RUN nix-env -iA python27Packages.flask
                        

                        For serious work, one might want to use a shell.nix file, but otherwise this is about it. As far as I can tlel, Conda only makes things more complex and confusing.

                        1. 1

                          If you’re installing Flask, no reason to use Conda. If you’re doing scientific computing or data science, Conda Forge has a massive set of pre-compiled libraries, and is well worth using.

                          That is, the use case isn’t the Conda packaging tool, the use case is the Conda Forge package channel.

                        1. 3

                          Good explanation but I feel it’s a bit beginner. Which is fine, this is good advice.

                          Would be interested if you write some more advanced tricks to save memory in Python.

                          1. 2

                            Thanks! There are other articles which go into more library-specific details, as part of an ongoing series of articles: https://pythonspeed.com/datascience/

                          1. 1

                            good techniques in practice, but i was hoping for some kind of entropy-coding implementation of NumPy that would allocate bits used for a symbol based on its frequency. this article does bring that up as a possible project.

                            1. 1

                              You might be interested in https://caterva.readthedocs.io/en/latest/ + https://github.com/Blosc/cat4py

                              It’s not NumPy-compatible, alas, but it’s an in-memory n-dimensional array that can use less memory and supports various operations you’d want from NumPy.

                              See https://www.youtube.com/watch?v=lP7A7lpzD18 for talk about it.

                            1. 1

                              I’ve always wondered why docker image layers are the result of a command. I think it would be nicer if you, for example, gave the builder an executable and it put each shared library on a new layer. Or you give it a Python script and it puts the Python interpreter on a layer, and each pip dependency on a new layer. That way all of your layers have a higher chance of being shared.

                              1. 2

                                I think Nix might let you do something like this.

                              1. 4

                                If the goal is to reduce bloat and install only what you need, then why not use alpine as a base instead of debian/ubuntu or centos?

                                1. 3

                                  For Python specifically:

                                  1. musl is subtly incompatible with glibc in a bunch of ways. I’ve encountered this in real world, others have as well. These bugs do get fixed, but using musl risks obscure bugs.

                                  2. Python wheels (pre-compiled binary packages) don’t work with musl. So whereas on glibc-based distros many packages can simply be downloaded and installed, on Alpine they need to be compiled.

                                  Long version: https://pythonspeed.com/articles/base-image-python-docker-images/

                                  For other languages these concerns may be less applicable, e.g. Go tends not to use libc much, opting to do syscalls directly.

                                  1. 1

                                    The base layer that is bigger with Debian compared to Alpine is shared anyway, and you avoid any compatibility issues between glibc and musl, for example.

                                  1. 3

                                    Why put the commands in a separate file, copy it to the container, and run it? In the Dockerfile you can write:

                                    RUN set -euo pipefail && \
                                        export DEBIAN_FRONTEND=noninteractive && \
                                        apt-get update && \
                                        apt-get -y upgrade && \
                                        apt-get -y install --no-install-recommends syslog-ng && \
                                        apt-get clean && \
                                        rm -rf /var/lib/apt/lists/*
                                    

                                    Which saves you the extra layer that needing the additional copy step would add.

                                    I’ve been under the impression that base images are respun often enough that there is really no point to running package upgrades on build. Is this not the case?

                                    Also, if you’re REALLY concerned about overhead from layers, you can squash your image into a single layer.

                                    1. 3

                                      If you have set -e your && becomes redundant and should be set to ;.

                                      1. 2

                                        I’d say it’s best stick with the && idiom, lest someone come along and copy the ; without the set. Shell tends to propagate via clipboard, particularly in Dockerfiles.

                                        1. 1

                                          I strongly disagree. If your premise of your code is that “people are going to copy/paste/break” the code anyway you have a lot of other struggles too.

                                          1. 1

                                            Well yeah, we’re starting from a baseline of using shell. Of course there are problems! My perspective is based on supporting Docker in a large organization. I think that it is better to set folks up for (partial) success based on what they will likely do in practice, rather than pretending that domain specialists are going to learn shell properly.

                                      2. 2

                                        The centos:8 image hasn’t been updated for two months. To be fair, none of the updates appear to be security fixes, but I would not rely on base image being up-to-date.

                                        1. 2

                                          Wow, that’s very strange. Every centos-based image out there having a slightly different layer containing the same updates seems like a missed opportunity for space and bandwidth savings because of layer re-use.

                                      1. 3

                                        Biology field uses Java a lot, specifically ImageJ and Fiji.

                                        1. 2

                                          I teach a physics lab that uses ImageJ/Fiji :) The students use it to measure the positions of silica beads in different solutions to determine the diffusion constant of the beads in those solutions.

                                        1. 1

                                          There are studies you can find with a tiny bit of googling, but you’d have to dig deep to find if methodology is any good.

                                          My personal anecdata: I translated algorithm I knew well from Python into math (not formal math, just enough for non-programmer scientist to read it with me helping with explanations). I asked someone who understood the code to review the math, they found no problems.

                                          In fact, IIRC I had 2 major transcription errors, in just 10 lines of math.

                                          1. 1

                                            If you enjoyed Turn This Ship Around, then you’ll “love” the Taylorist take The Goal and it’s modern agile retelling The Phoenix Project.

                                            1. 2

                                              Yeah, I loved the Goal. Didn’t think much of The Phoenix Project, having already read The Goal.

                                              1. 3

                                                Sorry for being obtuse, but I’m confused by the quotes around “love” and this reply. Are these worth reading (either for good reasons, or for anti-exanple reasons?)

                                                1. 3

                                                  I think those were not scare quotes, but quotes for emphasis. See also http://www.unnecessaryquotes.com/

                                              2. 1

                                                Those are fiction though, this book is not, so this book seems more credible.

                                              1. 5

                                                panic() is the equivalent of the exception mechanism many languages use to great effect. Idiomatically it’s a last resort, but it’s a superior mechanism in many ways (e.g. tracebacks for debugging, instead of Go’s idiomatic ‘here’s an error message, good luck finding where it came from’ default.)

                                                1. 5

                                                  Go’s idiomatic ‘here’s an error message, good luck finding where it came from’

                                                  I think the biggest problem here is that too often if err != nil { return err } is used mindlessly. You then run in to things like open foo: no such file or directory, which is indeed pretty worthless. Even just return fmt.Errorf("pkg.funcName: %s", err) is a vast improvement (although there are better ways, such as github.com/pkg/errors or the new Go 1.13 error system).

                                                  I actually included return err in a draft of this article, but decided to remove it as it’s not really a “feature” and how to effectively deal with errors in Go is probably worth an article on its own (if one doesn’t exist yet).

                                                  1. 6

                                                    it’s pretty straightforward to decorate an error to know where it’s coming from. The most idiomatic way to pass on an error with go code is to decorate it, not pass it unmodified. You are supposed to handle errors you receive after all.

                                                    if err != nil {
                                                        return fmt.Errof("%s: when doing whatever", err)
                                                    }
                                                    

                                                    not the common misassumption

                                                    if err != nil {
                                                        return err
                                                    }
                                                    

                                                    in fact, the 1.13 release of go formally adds error chains using a new Errorf directive %w that formalises wrapping error values in a manner similar to a few previous library approaches, so you can interrogate the chain if you want to use it in logic (rather than string matching) .

                                                    1. 5

                                                      It’s unfortunate IMO that interrogating errors using logic in Go amounts to performing a type assertion, which, while idiomatic and cheap, is something I think a lot of programmers coming from other languages will have to overcome their discomfort with. Errors as values is a great idea, but I personally find it to be a frustratingly incomplete mechanism without sum types and pattern matching, the absence of which I think is partly to blame for careless anti-patterns like return err.

                                                      1. 3

                                                        You can now use errors.Is to test the error type and they added error wrapping to fmt.Errorf. Same mechanics underneath but easier to use. (you could just do a switch with a default case)

                                                      2. 4

                                                        I guess you mean

                                                        if err != nil {
                                                            return fmt.Errorf("error doing whatever: %w", err)
                                                        }
                                                        

                                                        but yes point taken :)

                                                        1. 3

                                                          Sure, but in other languages you don’t have to do all this extra work, you just get good tracebacks for free.

                                                          1. 1

                                                            I greatly prefer the pithy, domain-oriented error decoration that you get with this scheme to the verbose, obtuse set of files and line numbers that you get with stack traces.

                                                        2. 1

                                                          I built a basic Common-Lisp-style condition system atop Go’s panic/defer/recover. It is simple and lacking a lot of the syntactic advantages of Los, and it is definitely not ready for prime time, at all, but I think maybe there’s a useful core in there.

                                                          But seriously, it’s a hack.

                                                        1. 4

                                                          sqlite is fast, but for some reason at one past job we found the python bindings oddly terribly slow. I recall a colleague that looked into it, and I vaguely recall something about weird python c-lob wrapping code using strange default configuration parameters. Other languages (even ostensibly slower ones overall) we used at the time had much faster bindings.

                                                          Been a while though, maybe the situation is different now.

                                                          1. 4

                                                            There’s also apsw, which takes a different approach to wrapping SQLite, and claims to be faster: https://rogerbinns.github.io/apsw/

                                                            It suffers from the author’s decision to make it harder to install, though: https://rogerbinns.github.io/apsw/download.html#easy-install-pip-pypi

                                                          1. 1

                                                            Or offload to tools more ubiquitous than cloud-y things: RDBMS’s that have excellent strategies for dealing with this and many other problems already baked in, and a langauge where you can abstract the ‘what’ of your problem without having to worry about these underlying ‘how’ issues. Sure some learning curve if you don’t already know one, but you probably should, and then you get all those other features too.

                                                            This post makes me a bit nervous because it reminds me a bit of Pounding A Nail: Old Shoe or Glass Bottle?. Sure, you could just start processing more data than you have RAM by directly implementing your own strategies in Python, but it feels like a really bad idea re-implementing those strategies in a simplistic way, and the answer feels to me like it should be ‘Your question is wrong’. (I’m using ‘feels’ on purpose because I’m not putting any effort right now into being more precise about what is an intuitive sense about problem solving.)

                                                            1. 1

                                                              Yeah, one of my followup articles is about how you can use a SQL database (SQLite works very nicely) as an indexable storage for Pandas.