1. 50

  2. 35

    Unlike say, VMs, containers have a minimal performance hit and overhead

    Ugh. I really hate it when people say things like that because it’s both wrong and a domain error:

    A container is a packaging format, a VM is an isolation mechanism. Containers can be deployed on VMs or they can be deployed with shared-kernel isolation mechanisms (such as FreeBSD Jails, Solaris Zones, Linux cgroups, namespaces seccomp-bpf and wishful thinking), , or with hybrids such as gVisor.

    Whether a VM or a shared-kernel system has more performance overhead is debatable. For example, FreeBSD Jails now support having per-jail copies of the entire network stack because using RSS in the hardware to route packets to a completely independent instance of the network stack gives better performance and scalability than sharing state owned by different jails in the same kernel data structures. Modern container-focused VM systems do aggressive page sharing and so have very little memory overhead and even without that the kernel is pretty tiny in comparison to the rest of a typical container-deployed software stack.

    Running everything as root. We never let your code run as root before, why is it now suddenly a good idea?

    This depends entirely on your threat model. We don’t run things as root because we have multiple security contexts and we want to respect the principle of least privilege. With containerised deployments, each container is a separate security context and already runs with lower privileges than the rest of the system. If your isolation mechanism works properly then the only reason to run as a non-root user in a container is if you’re running different programs in different security contexts within the container. If everything in your container is allowed to modify all state owned by the container then there’s no reason to not run it all as root. If you do have multiple security contexts inside a container then you need to think about why they’re not separate containers because now you’re in a world where you’re managing two different mechanisms for isolating different security contexts.

    1. 21

      I think you mean an image is a packaging format, whereas a container is an instance of a jail made up of various shared-kernel isolation mechanisims (including the wishful thinking) as you mentioned.

      Yes, the terminology is unfortunate. My reimplementation of Docker calls it an “instance” rather than “container”.

      1. 3

        yeah, the “’never run as root in your container” thing kills me

        1. 9

          IIUC that’s all because the way Linux isolates users (with the whole UID remapping into a flat range thing) is weird and there’s way too many security bugs related to that.

          1. 1

            I don’t know if this is still true, but part of where this advice comes from is that it used to be that running as root meant running as root on the host (i.e. the mechanism you’re talking about was not used by Docker). In theory this was “fine” because you could only get at stuff on the container environment, but it meant that if there was a container breakout exploit you were unconfined root on the host. So running as non-root in the container meant that you’d have to pair a container breakout with a privilege escalation bug to get that kind of access.

            In other words: the isolation mechanism did not work properly.

        2. 1

          That’s interesting. I haven’t actually bench tested the two in years. I’ll have to revisit it.

          1. 1

            You might want to have a look at NVIDIA’s enroot or Singularity for some lower-overhead alternatives. I’ve briefly looked at enroot after I saw the talk about Distributed HPC Applications with Unprivileged Containers at FOSDEM 2020, but sadly haven’t gotten a chance to use them at work yet.

            1. 2

              Have you tried https://github.com/weaveworks/ignite to just run a docker image in a VM instead of a container?

              1. 1

                No, haven’t stumbled across that before. Thanks, that looks very interesting!

                1. 1

                  That seems interesting. I wonder what benefit it provides compared to the shared-kernel isolation mechanism used by docker run <container>. Do I get stronger isolation, performance boost, or something else?

                  1. 2

                    I think there are always tradeoffs, but a VM may be easier to reason about than a container still. It’s a level of abstraction that you can apply thinking about a single computer to.

                    I do think that you get stronger isolation guarantees too. You can also more easily upgrade things, so if you have a kernel vulnerability that affects one of the containers, you can reload just that one. There are many issues that affect hypervisors only or guests only.

                    At launch we used per-customer EC2 instances to provide strong security and isolation between customers. As Lambda grew, we saw the need for technology to provide a highly secure, flexible, and efficient runtime environment for services like Lambda and Fargate. Using our experience building isolated EC2 instances with hardware virtualization technology, we started an effort to build a VMM that was tailored to run serverless functions and integrate with container ecosystems.

                    It also seems like a compromise between the user interface for a developer and an operations deep expertise. If you have invested 15 years in virtualization expertise, maybe you stick with that with ops and present a container user interface to devs?

                    For me, one of the big things about containers was not requiring special hardware to virtualize at full speed and automatic memory allocation. You’re never stuck with an 8GB VM you have to shut down to prevent your web browser from being swapped out when you’re trying to open stack overflow. You know 8gb was suggested, but you also see that only 512MB is actually being used.

                    Most hardware these days has hardware acceleration for virtualization and firecracker supports the virtio memory ballooning driver as of Dec 2020, so many of the reasons I would have used containers in 2013 are moot.

                    As an ops person myself, I find containers to often have an impedance mismatch with software defaults. Why show a container that is limited to two cores that it has 64 cores? Haproxy will deadlock itself waiting for all 64 connection threads to get scheduled on those two cores. You look in there and you’re like ‘oh, how do I hardcode the number of threads in haproxy now to two…’. It’s trivial with haproxy, but it’s not default. How many other things do you know of that use nproc+1 and will get tripped up in a container? How many different ways do you have to configure this for different runtimes and languages?

            2. 1

              Containers can be deployed on VMs

              OT because I agree with everything you said, but I have yet to find a satisfying non-enterprise (i.e. requiring a million other network services and solutions).

              Once upon a time, I was sure VMware was going to add “deploy container as VM instance” to ESXi but then they instead released Photon and made it clear containers would never be first-class residents on ESXi but would rather require a (non-invisible) host VM in a one-to-many mapping.

                1. 1

                  We use this at my work (Sourcegraph) for running semi-arbitrary code (language indexers, etc.), it works really well.

            3. 23

              The ‘correct’ shell example given in this post isn’t.

              while IFS= read -r -d '' file; do
                some command "$file"
              done < <(find . -type f -name '*.csv' -print0)

              This loop is much more complex than it needs to be and will search all subdirs for files ending in ‘.csv’.

              You don’t need find(1) or to change the value of IFS to iterate over files in a dir.

              for f in *.csv ; do
                stat "${f}"

              Get’s you everything you need. If you really need to find fines in subdir find(1) has the -exec flag.

              find . -type f -name '*.csv' -exec stat \{} \;
              1. 12

                Docker has some suboptimal design choices (that clearly hasn’t stopped its popularity). Yes, the build process, with the line continuation ridden Dockerfile format that makes it impossible to comment what each thing is there for, with implicit transactions that buries temporary files in the layers, and the layers themselves, that behave nothing like an ideal dependency tree, is one thing, but that’s fixable. What makes me sad are the fundamental design choices that can’t be satisfyingly fixed by adding stuff on top, such as being a security hole by design and containers being stateful and writable, and therefore inefficient to share between processes and something you have to delete afterwards.

                What is a more ideal way to build an image? For a start, run a shellscript in a container and save it. The best part is that you don’t need to copy the resources into the container, because you can mount it as a readonly volume. You need to implement rebuilding logic yourself, though, but you can, and it will be better. Need layers? Just build one upon another. Even better, use a proper build system, that treats dependencies as a tree, and then make an image out of it.

                As for reimplementing Docker the right way from the ground up, there is fortunately no lack alternatives these days. My attempt, selfdock is just one.

                1. 8

                  As for reimplementing Docker the right way from the ground up, there is fortunately no lack alternatives these days.

                  What about nixery?

                  Especially their idea on “think graphs not layers” is quite an improvement over previous projects.

                  1. 4

                    I spent some time talking about the optimisations Nixery does for layers in my talk about it (bit about layers starts at around 13:30).

                    An interesting constraint we had to work with was the restriction on the maximum number of layers permitted in an OCI image (which, as I understand it, is an implementation artefact from before) and there’s a public version of the design doc we wrote for this on my blog.

                    In theory an optimal implementation is possible without that layer restriction.

                    1. 2

                      Hey! Thanks for sharing and also thank you for your work, true source of inspiration :)

                  2. 2

                    My attempt, selfdock is just one.

                    This looks neat. But your README left me craving for examples.

                    Say I want to run or distribute my python app on top of this. Could you provide an example of the equivalent to a docker file?

                    1. 4

                      Thanks for asking! The idea is that instead of building, distributing and using an image, you build, distribute and use a root filesystem, or part of it (it can of course run out of the host’s root filesystem), and you do this however you want (this isn’t the big point, however).

                      To start with something like a base image, you can undocker a docker image:

                      docker pull python:3.9.7-slim
                      sudo mkdir /opt/os/python:3.9.7/
                      docker save python:3.9.7-slim | sudo undocker -i -o /opt/os/myroot

                      Now, you have a root filesystem. To run a shell in it:

                      selfdock --rootfs /opt/os/myroot run bash

                      Now, you are in a container. If you try to modify the root filesystem from a container, it’s readonly – that’s a feature!

                      I have no name!@host:/$ touch /touchme
                      touch: cannot touch '/touchme': Read-only file system

                      When you exit this process, the reason for this feature starts to show itself: The process was the container, so when you exit it, it’s gone – there is no cleanup. Zero bytes written to disk. Writing is what volumes are for.

                      To build something into this filesystem, replace run with build, which gives you write access. The idea is as outlined above, to mount your resources readonly and running whatever:

                      selfdock --rootfs /opt/os/myroot --map $PWD /mnt build pip install -r /mnt/requirements.txt

                      … except that if it modifies files owned by root, you need to be root. As the name implies, selfdock doesn’t just give you root.

                      Then, you can run your thing:

                      selfdock --rootfs /opt/os/myroot --vol $PWD/data /home python app.py

                      Note that we didn’t specify user- and group ID to run as – it just does (anything else would be a security hole). This is important for file permissions, especially when giving write access to a volume as above. But since the root filesystem is readonly, you can run thousands of instances out of it, and the overhead isn’t much more than spawning a process. The big point here is not in the features, but in doing things correctly.

                      1. 2

                        That sounds very similar to what systemd-nspawn offers. Once you deal with unpacked root filesystems it may be another solution to look at.

                        1. 1

                          So, it has even more resemblance to chroot but with more focus on isolation and control of resource usage, IIUIC.

                          A bit of feedback, if I may. The whole requirement of carrying files around, will put people off. Including myself. I refrain from using docker because of the gigantic storage footprint any simple thing requires. But the reason it is so popular is that it abstracts away the binary blobs. People run docker commands and let it do its thing, they don’t need to fiddle with or even know about the images which are stored on their hard drive. It was distributed with dockerhub connectivity by default. So people only worry about their docker files and refer to images as a URL or even just a slug if the image is in docker hub.

                          Similarly, back in the day, many chroot power users had a script to copy a basic filestructure to a folder and run chroot. I think most people would want this. Even if inconsciently. A command that does the complicated parts with a simple porcelain.

                    2. 9

                      No. If one needs Docker-compatible container images, then nixpkgs’ dockerTools is a straightforward way to declaratively build images which skips all of the painful parts of Docker listed in the article. Even container size is automatically amortized, and container layouts are automatically optimized for caching.

                      1. 5

                        One big misconception that made it into the “It’s not perfect but better than nothing” is that layering is only accumulative, once a layer has certain files - the final image will contain this data, even if you remove them in a later RUN command. Using dive will show you where your mistakes are. Yikes, even the mentioned hadolint would show that issue I believe.

                        I’ve seen someone using gitlab-ci/bazel/debootstrap/saltstack to build containers without writing Dockerfiles, but still using a layering & caching mechanism. Took several months to implement, but was definitively quite some gem once it worked - we even managed using same output to provision bare-metal servers afterwards.

                        The thing that powers distroless is bazelbuild/rules_docker, it’s quite neat but seeing how nixery and nix/os solves everything in a coherent fashion makes me how others will ever succeed as an alternative - so I’m really excited about Ariadne’s mentioning of building “distroless for Alpine Linux

                        Overall, I think for writing regular Dockerfiles (for Python applications) - pythonspeed.com gives best overview of common pitfalls.

                        1. 4

                          Output of hadolint. I don’t get why would you give so strong opinionated advice and then break your own advice in the same article?

                          -:4 DL3008 warning: Pin versions in apt get install. Instead of `apt-get install <package>` use `apt-get install <package>=<version>`
                          -:4 DL3015 info: Avoid additional packages by specifying `--no-install-recommends`
                          -:4 DL3009 info: Delete the apt-get lists after installing something
                          -:5 DL3059 info: Multiple consecutive `RUN` instructions. Consider consolidation.
                          -:5 DL3013 warning: Pin versions in pip. Instead of `pip install <package>` use `pip install <package>==<version>` or `pip install --requirement <requirements file>`
                          -:5 DL3042 warning: Avoid use of cache directory with pip. Use `pip install --no-cache-dir <package>`
                          -:6 DL3059 info: Multiple consecutive `RUN` instructions. Consider consolidation.
                          -:6 DL3013 warning: Pin versions in pip. Instead of `pip install <package>` use `pip install <package>==<version>` or `pip install --requirement <requirements file>`
                          -:6 DL3042 warning: Avoid use of cache directory with pip. Use `pip install --no-cache-dir <package>`
                          -:15 DL3003 warning: Use WORKDIR to switch to a directory
                          -:19 DL3059 info: Multiple consecutive `RUN` instructions. Consider consolidation.
                          -:20 DL3059 info: Multiple consecutive `RUN` instructions. Consider consolidation.
                        2. 4

                          I think what is missing is a kind of lock file to reproducibly rebuild the image. Always saving the full image and deliberately updating each version of every base image or dependency is not ideal.

                          (I know it is impossible with the current approach)

                          1. 1

                            For anything debian (Https://snapshot.debian.org) or nix related, you can pin to a specific release of the repo.

                          2. 3

                            You shouldn’t use latest but you also shouldn’t be using other normal tags.

                            You probably shouldn’t use latest but you should totally use version tags and upgrade vigorously/continuously. The typical consequence of fixating on a specific sha version is that after 3 years of development, you’re still on that three old year sha version.

                            Fixing tiny problems all the time should be preferred to fixing large problems few times a year. Or huge problems every 5 years.

                            apt-get is a problem. First, don’t run apt-get upgrade otherwise we just upgraded all the packages and defeated the point. We want consistent, replicable builds.

                            No, it is not a problem. If apt-get update && apt-get upgrade breaks something in your workflow, then either your distro is totally broken or you’re doing something totally wrong or niche (in which case it would be ok, of course).

                            Always do apt-get update && apt-get upgrade as the first step of every Dockerfile, or at least in the beginning of your base container.

                            1. 3

                              “Less && and && \please. This one isn’t your fault but sometimes looking at complicated Dockerfiles makes my eyes hurt.”

                              He’s never ran into the fun of exceeding the max number of layers in a container I see.

                              1. 6

                                What helps in terms of readability is setting proper shell options within the Dockerfile.

                                So putting SHELL ["/usr/bin/env", "bash", "-euo", "pipefail", "-c"] on top can help in these cases, depending on who is consuming your Docker image you might want to revert that to the previous SHELL afterwards though.