1. 21
  1.  

  2. 19

    I really dislike the containers == performance meme. Containers on Linux have never been about performance. They were designed for packing as many jobs into each machine as possible. With enough machines, reducing overhead is worth the performance hit.

    VMs can’t share physical memory without invasive changes in the guest OS. And the guest OS itself has its own memory overhead. Containers solve these problems, at the expense of stressing scalability of the host OS internal data structures and mechanisms.

    Unless job packing saves you enough resources to hire at least one engineer to manage whatever orchestrator you choose, containers just add operational overhead.

    If containers make sense for your developer productivity, that’s great. Modern tooling affords plenty of conveniences. But they offer most people nothing for performance or scalability.

    1. 5

      I really dislike the containers == performance meme. Containers on Linux have never been about performance. They were designed for packing as many jobs into each machine as possible. With enough machines, reducing overhead is worth the performance hit.

      FWIW, I’ve never heard of this meme, and this idea runs contrary to almost every performance test we’ve done at $WORK while containerizing our microservices. Yes, putting multiple docker containers on a single box does provide a nice benefit (though there’s nothing stopping you from shipping multiple static binaries and managing them through a service manager), but this is just a bit of an extension to Amdahl’s Law, where you’re still limited by other factors on the box.

      At least with the folks I’ve talked to, containers have always been an ergonomic benefit, one often taken at the expense of some performance. At $WORK, we thoroughly performance test containers to see if the degradation is actually worth it for a given microservice .

      1. 4

        Containers are wonderful from an ops perspective where you can focus on putting software “somewhere” because it’s nicely contained.

        1. 20

          The big problem is that ‘containers’ overload a bunch of loosely-related things:

          • The abstract idea of a program, its configuration, and its dependencies all packaged together.
          • The concrete representation of that abstraction as a distribution format created from a bunch of composed layers.
          • A mechanism for building things in that format.
          • A set of isolation mechanisms for running things distributed in that format.

          The last bit is particularly bad because it’s completely different on different platforms. On Linux, it uses namespaces + cgroups + seccomp + string + duct tape + prayer. On macOS it use xhyve on top of the hypervvisor framework. On Windows it uses Hyper-V. On FreeBSD it uses jails or bhyve. These have very different performance characteristics. The original motivation for jails was that there would be less overhead from being able to share kernel services than start an entire new copy of the kernel, but the most recent versions of jails support having a separate copy of the network stack per jail because contention in the network stack was making it slower to run large numbers of jails than to run separate VMs. With memory ballooning / hot-plug, it’s quite feasible to spin up a load of VMs and have them dynamically adapt the amount of memory that they have. With modern container-optimised VM systems, each VM is actually a copy-on-write snapshot of a single image and so shares a load of kernel memory that doesn’t change between runs.

          The building mechanism is a somewhat unfortunate conflation because people often mean ‘Docker’, which is an awful imperative build system that makes all of the non-trivial things in software packaging (reproduceable builds, provenance auditing, and so on) harder than they should be.

          The distribution format isn’t great either, because it’s based on tarballs and so isn’t naturally amenable to any kind of live validation. You can check the hash matches for the tarball itself, but for reusable read-only layers you really want something like dm-verity that is able to check every block against tampering as you use it.

          1. 2

            The building mechanism is a somewhat unfortunate conflation because people often mean ‘Docker’, which is an awful imperative build system that makes all of the non-trivial things in software packaging (reproduceable builds, provenance auditing, and so on) harder than they should be.

            Practically speaking, how many folks don’t use Docker when deploying? Much like Linux is the de-facto OS for cloud services, Docker containers are the understood context for “containers”, and so the bullet points you list already have choices made for them.

            1. 2

              To the best of my knowledge most large-scale deployments don’t use Docker. They use something like Kubernetes + containerd + runc, rather than anything using the Docker daemon or the Docker containerd shim. Docker is commonly used as the build tool, but other things are gaining popularity. In terms of deployment, VM-based isolation is increasingly common relative to the pile of namespace + cgroup + seccomp hacks, though Google is still pushing gVisor.

            2. 1

              Is there some reason these container based systems aren’t using shared memory for data transfer between machine-local components?

              Are they relying on the network stack for security & process isolation?

              1. 1

                Using the network stack isn’t actually that bad for jail-to-jail communication: messages over the loopback interface can bypass a lot of the stack and have a bunch of fast paths. The contention issues are more to do with network-facing services. Whether you use fine-grained locking, RCU, or anything else, there are bits of the network stack that become contention points. With the VNET work, the only shared state between two jails’ network stacks is at the ethernet layer and that’s just a simple set of ring buffers, so scales much better than anything above it. Now that SSDs are common, containerised environments are starting to see the same problems with the storage stack: anything that’s giving ordering guarantees between bits of a global view of a filesystem can introduce contention and you’ll often see better performance from giving each container a separate block device and private filesystem.

            3. 8

              Just like an uberjar or a static binary. Ohh wait those do not require a 1million line additional runtime.

          2. 6

            The fact that PyStone showed a significant difference is surprising to me because PyStone doesn’t do any I/O and makes a small fixed number of syscalls, with the number of syscalls not changing at all when I change the number of iterations I ask it to do. For me it does exactly 525 syscalls regardless of whether I ask for 50k iterations or 100k iterations.

            Are you sure the python3.9 inside and outside the container are identical?

            I see the speed reported by PyStone not change when I run it under strace, which I believe makes syscalls slower by a lot more than seccomp does. The time to start the benchmark changes from about 20ms to about 140ms, but the rate at which benchmark iterations run didn’t noticeably change for me under strace. I see about 3%ish variation in the reported speed between repeated runs.

            (Disclaimer: I tested PyStone 1.1 with a tiny patch to call time.clock_gettime(time.CLOCK_MONOTONIC) instead of time.clock(), so my pystone is slightly different than yours. Still I don’t think the inside of the benchmark loop was changed.)

            1. 3
              1. Yes, it’s weird to me too.
              2. Notice I run with --privileged and the performance difference goes away. So it’s not the Python version.
              1. 2

                At the end of the article, the author compared the same container with and without --privileged, the github issue linked in the article seems to narrow down on seccomp being the likely culprit (and/or the possibility that enabling seccomp is triggering additional meltdown/spectre mitigations) .

                1. 3

                  The meltdown/spectre mitigations seems plausible, see https://github.com/docker-library/python/issues/575#issuecomment-840737977

                  1. 1

                    I also read the article. Thing is, you’d expect seccomp would probably only affect syscalls, so it’s still surprising.

                  2. 1

                    The benchmark result looks weird to me as well, since it’s kind of pure computation workload with constant syscalls.

                  3. 4

                    I’d be interested in the output of perf record/report here since it would presumably show what’s causing a slowdown.

                    1. 5

                      Yeah, I’m quite interested in why this is happening and would love to avoid the speculation in the article.

                    2. 1

                      I wonder which Docker images this was tested on. I seem to remember that the official python images are slower than ubuntu because of some detail around how it’s compiled. No link handy, though.

                        1. 1

                          You got it, thanks.

                        2. 1

                          fedora:33, in order to match the host operating system and remove that as a factor.