1. 78
  1. 27

    Author here. Containers always seemed a little magical to me. So I dug into how they work and then built a “container runtime” that only uses the change root sys call (Which has been in UNIX since the 70s).

    This was not only fun, but took a way a bit of the magic, so I could better understand what’s going on when I use real containers.

    Let me know what you think!


    Fun fact: In my first draft of the article, I had this idea that it would be about time travel. If I went back in time, how far back could I still implement something like containers, and so I was focused on just using chroot because it’s so old.

    If people like this type of stuff, I’ll do namespaces next but pick a less inflammatory name like “How X helped me understand containers.”

    1. 6

      Edit (again): Ignore me, I’m old and when I was administering a big freebsd shared hosting platform in 2005 a chrooted jail was a very specific thing (it was a chrooted jail!). It’s clear reading around that the term has taken on broader meaning to just refer to chroot. My apologies to author, this appears to be the accepted use.

      It feels like you might be misunderstanding the term “chrooted jail” which is referring to the combination of the use of chroot with FreeBSDs jail. The reason I mention that is because then in the footnote you mention cgroups offer more protections than chroot, but that’s also the point of jail. Anyway this part confused me 😳


      On the whole I liked the post as a kind of dive into a linux syscall and what you can do with it. However I’m a little bit concerned about simplifying the idea of a container a chroot. When I think of containers I explicitly think of them as a binding between a kind of chroot and some process controls that enforce behavior. Wether that’s BSDs chrooted jails, illumous zones, or linux’s chroot, and cgroups.

      1. 4

        I had the same initial reaction, and I thought that must be wrong, linux does not have jails. Then, I realized, of course, the reason is simply some linux people are not aware of jails at all.

        1. 8

          There are dozens of us!

        2. 2

          Oh, interesting.

          I didn’t see that it’s used to refer to BSD jails, but to using chroot on Linux. Here is an example:

          A chroot jail is a way to isolate a process and its children from the rest of the system. It should only be used for processes that don’t run as root, as root users can break out of the jail very easily.

          The idea is that you create a directory tree where you copy or link in all the system files needed for a process to run. You then use the chroot() system call to change the root directory to be at the base of this new tree and start the process running in that chroot’d environment.


          Here is another

          And you can find many explanations that seem to call the changed root a ‘chroot jail’.

          1. 4

            I updated my first comment, a quick search shows that you are very much using this term how folks expect. Unfortunately I worry this might be a case of getting old.

            1. 2

              Oh that makes me very sad. That seems like it has got to be a kind of false appropriation of one communities terminology into another right? 😰

              1. 3

                I’m not sure what the exact chronology of FreeBSD jails is, but I was definitely calling chroot environments “chroot jails” in the late 90s/early 00s. So maybe you’re both old and young 😁

                Edit: neck-and-neck! Looks like they were released in March 2000

                1. 2

                  Jails landed in FreeBSD in ‘99. The place I was using them I think opened in 2000-2001. Now I don’t believe anything about my own memory. Except that at the least in the circles I was running in a chrooted jail meant something different than being chrooted haha.

                  1. 6

                    https://www.cheswick.com/ches/papers/berferd.pdf This paper was written ~ 1992

                    On 7 January 1991 a cracker, believing he had discovered the famous sendmail DEBUG hole in our Internet gateway machine, attempted to obtain a copy of our password file. I sent him one. For several months we led this cracker on a merry chase in order to trace his location and learn his techniques.

                    This paper is a chronicle of the cracker’s “successes” and disappointments, the bait and traps used to lure and detect him, and the chroot “Jail” we built to watch his activities

                    I would guess (based on no evidence at all) that the name “jail” for the *BSD feature was probably inspired by existing usage. My recollection of the 1990s was that it was fairly common to refer to setting up chroot jails for things like anonymous ftp servers

                    1. 2

                      Nice! So the BSD jail would be have been like “now you can REALLY jail your jail” :-D

          2. 3

            sailor[0] and captain[1] implement pseudo-containers for NetBSD using chroot.

            [0] https://gitlab.com/iMil/sailor [1] https://gitlab.com/jusavard/captain/

            1. 2

              This is a really cool article, thanks for writing it. Do you have thoughts on newer system calls and how they can be used to improve isolation? I’m thinking of pledge specifically from here but I know there are a ton of others as well.

              1. 1

                Thanks! I really don’t know much about pledge, but I think Andreas Kling has some videos on adopting it for Serenity that have been on my to-watch list.


              2. 2

                The difference between chroot and containers was the marketing splash, but that marketing spend could’ve been a dud. The tooling and arguably saner defaults around docker was what sold it. chroot (jails) were the province of the FreeBSD folks (and maybe Solaris) and linux only had chroot with less protections. chroot ended up being known more as a security flaw (due to misuse) than as a feature: https://cwe.mitre.org/data/definitions/243.html but docker, et al have kept enough sane defaults to make container escapes conference talk fodder but not large-scale malicious actor toolkit. Small scale state actors, who knows.

              3. 13

                Although, as many other commenters have already noted, a container is more than a chroot, this article nonetheless helps in dispelling some magic behind containers…

                And dispelling the containers magic is a good thing, because many assume a container is actually more akin to a VM than to a glorified chroot with lots of isolation and mappings on-top…

                Thus in the end, although the article has a lot of technical shortcomings, and especially security related ones like for example forgetting to state that a process (especially running as root) can easily escape a chroot, is a good start in the right direction. :)

                1. 15

                  Yeah… that title is misleading.

                  Sure, containers images are tarballs (as they should be) and chroot is part of the process. But there’s also a lot more going on, which you allude to but kind of dismiss.

                  If you want a deeper version of this blog concept written a long time ago by an expert, https://ericchiang.github.io/post/containers-from-scratch/

                  1. 5

                    * if you ignore networking (or parts of it), and using another distro, or shipping the product to another box/host OS.

                    Yes, many things were doable with chroot and in practice no one did it because it was too tedious except for some ad-hoc local use cases.

                    I’m not a huge fan of containers but I kinda disagree here. (FWIW, skimming the article didn’t change my mind after forming an opinion after reading just the headline)

                    Not saying it’s a bad article at all, but the premise and headline aren’t good.

                    1. 8

                      and using another distro, or shipping the product to another box/host OS.

                      I can agree about networking. But trying another distro was pretty much the thing most people used chroot for back in the day. Some dostros even recommended and had instructions to do that to try things out before installing on metal.

                      1. 1

                        I know, but I’ve often not had a great experience - but I guess it was a different time where kernels weren’t quite so complete and uniform across “modern” distros. Think 2.x era.

                        1. 1

                          It was indeed a practice that required an order of magnitude or two more knowledge than docker, which encourages treating everything as black boxes.

                          What’s a tragedy is how if you deploy a bunch of HTTP servers on a fat machine on their own chroot, everyone will call you crazy. But alap the word Docker into it (which does essentially the same) and all of the sudden your are a cloud wizzard.

                    2. 5

                      “It’s just processes!” That’s what I used to say a few years ago when people were comparing containers to VMs (all the reproducibility rage was on vagrant back then (remember? ;))

                      So this article is already getting a bit old, although it’s a great article that I would have loved to read 3 -or more- years ago!

                      But the real reason I’m commenting is that, as already pointed by others, containers are not “just chrooted processes” anymore. It’s all the rest that comes along (plus the need to increase server density). Actually it’s so much “all the rest”, that the same cloud density can be achieved using VMs now (see https://github.com/firecracker-microvm/firecracker-containerd for example) and yet we still keep the container idea.

                      1. 8

                        Hey, thanks for reading!

                        I’m the author and plead guilty to writing a title that is a bit of a stretch. I think if I had to rename the article it would be something like “How chroot helped me understand containers” because I get that namespaces and cgroups and unionfs exist, but it really was a breakthrough in understanding for me to think of containers as a chrooted process. My goal is to share that understanding and perhaps, in a future article, layer on other things.

                        1. 5

                          Nothing to be guilty of, the article was great, whatever the title. My point mainly is that containers are evolving, and it might be that “containers are just linux processes” is soon to be not true anymore (at least not for everything). Firecracker-containerd are VMs, wasm-shim are wasm module instances running in a wasm VM but the vast majority is still good old processes behind execve+flags/chroot syscalls. For how long is the question.

                          And I think you achieved your goal of sharing good stuff :) thanks for writing!

                          1. 3

                            Firecracker is on my to-look-into list!

                            But also, truly thanks for the compliment on the article. I was starting to dread the feedback on this one. Receiving critical feedback is not my strength.

                        2. 4

                          Container is a very overloaded term. It is used to describe both a packaging and deployment system, and an isolation mechanism. Most containers are run by containerd these days and it has various pluggable back ends. You can use runc to run them isolated on Linux using namespaces, cgroups, and setcomp-bpf, or runj to run them isolated on FreeBSD using jails, but you can also deploy them in separate VMs on Hyper-V or various different KVM-based systems (including the Firecracker one that you link to). There’s even a back end for OpenBSD’s hypervisor.

                          It a shame that containers are often used to mean shared-kernel isolation because the packaging, distribution, and deployment models that the term encompasses are far more interesting and influential parts of the ecosystem. FreeBSD lost a big chunk of the server market by missing this and thinking it was just about jails.

                        3. 5

                          I’m super impressed with the author doing a long post on containers and chroot and not once mentioning FreeBSD jails, nor solaris zones. Although it does sum up the general NIH syndrome and blinders of most of the linux community.

                          1. 4

                            Earthly is multi-stage dockerfiles with a marketing budget then?

                            1. 3

                              As others have pointed out, this is not yet a container. This is a chroot’d environment. You also need to isolate the process tree into its own kernel namespaces (see https://en.wikipedia.org/wiki/Linux_namespaces).

                              1. 3

                                For the title I’d write something like “little more than chroot”, because it’s a little more than just standard chroot, but depending on the container or what you actually use it for really not that much.

                                In my experience the main use case, what people really want from containers really is well covered with what chroot does. That main use case being “I want exactly those libraries”. For most applications having a static or fat binary (as Java’s fat binaries, Go with embedded files, Deno has something too, etc.) would do the trick as well, where the developer is happy that the thing “just runs” and the sysadmin/SRE/hosting person is happy that they don’t need worry about application specifics. If then this binary would be stripped of some permissions for a lot of situations this would work, when a bit more is needed you add the chroot and when even more is needed you virtualize the network stack, do some accounting, limit resources, etc.

                                I think the big thing with containers is not so much about chroot, but that Docker became so wide spread that there’s a lot of standard ways of doing stuff which means you only need to hack a bit if you leave that standard path and even those are somewhat well understood, even though they often “break” a lot of security, when thinking about containers as jails.

                                In other words, while yes it’s not just chroot, I think that this simplification tends to be what people mostly think about when they think about containers. The other part is borderline not containers and that’s a form of registry that works pretty much like a package manager making building, uploading and deploying relatively simple.

                                It’s also why I hope that this space will get more contested, by simpler implementations, and using them differently. It feels like while Docker helped with “standardizing”, just doing copies of it with some variations in technical details might be a bit limited. With everything optimized around it switching becomes hard, even when someone comes around with innovation in that area they’ll be forced to keep compatibility to find serious adoption. Not saying that keeping compatibility or sticking to standards is bad, but merely that it can make it at harder to innovate at times.

                                On top of that a lot of the OCI members for examples have interest that not too much changes not just in terms of technical interfaces. The majority is in a market position where their goal is to keep status quo in the industry.

                                1. 2

                                  Namespaces mean when you start a container, you can’t see it in your process list

                                  That’s not really right. You can usually see the whole process tree of your containers from outside - with all the benefits, like attaching a profiler from the host system.

                                  Also the minimal implementation reminded me of bocker https://github.com/p8952/bocker

                                  1. 1

                                    i can somewhat relate to the article, but well… Sure, the concept of providing a per-process environment is nothing new, and chroot has been there forever. Heck, in the early 2000s i once maintained an apache plugin in Debian that did setup virtual chroots (libapache2-mod-chroot) within the Apache process for easier separation of virtual hosts, but thats not nearly what docker or other container solutions do for you nowadays.

                                    Yes, the concept MAY be the same, but now its kinda chroot on steriods, if you like. The whole ecosystem around (cgroups, especially networking namespaces) make it much more usable.

                                    Edit: if i explain the differences between containers and virtual machines, i usually also start explaining about chroot and separating process environments, so yes, the article has some point here.

                                    1. 1

                                      When I run Docker, it’s running a Linux process on my Mac. That’s not just a matter of chroot.

                                      1. 4

                                        No. But what if I told you it was “just” a chroot in a Linux VM?

                                        1. 1

                                          Exactly my thinking. And no one has commented on my footnote on mac native ‘containers’ but osx supports chroot, so its totally possible to have native mac containers, they will simluate linux prod less well, but would be super low overhead.

                                          1. 3

                                            chroot is not sufficient for shared-kernel virtualization. It’s worth reading the Jails and Zones papers (or watching Bryan Cantrill’s excellent Papers We Love talk about them).

                                            In your examples, root does the chroot, but your chroot does not contain /etc/passwd and su and so you cannot drop privileges. A root user inside a chroot can use the mknod or mount system calls to mount device nodes and then get complete read-write access to the raw device underlying the disk, and so on. Shared-kernel virtualisation needs to constrain root. There are also some non-filesystem namespaces, such as SysV IPC, network ports, and so on that also need to be constrained (Bryan points out that Jails left this to future work, Zones then did it. Jails then gained VNET support, allowing a completely separate instance of the network stack for the jail, which improved performance by removing contention on some network stack structures).

                                            On XNU, the sandbox framework probably could be used to constrain some of these things.

                                            Note: I am not using the term ‘containers’ here, because containers are a packaging and distribution model and do not necessarily imply shared-kernel virtualisation, containers can also run in separate VMs or be deployed with no isolation at all.

                                            1. 2

                                              Maybe we are talking about different things. I was just suggesting that many people use a docker-compose on a mac to start up a bunch of deps. This ends up involving a linux VM that is hidden from sight. But you could do this with an image format that you start up with chroot, which contains the native mac deps. It wouldn’t be a container exactly, but a distribution framework and way to start up mac native versions of things. There are downsides to it, but also pluses and it is possible.

                                              (There is also sandbox-exec which might be useful for you know actual sandboxing )

                                              1. 1

                                                Please re-read the last paragraph of my post. The problem with a lot of your article is that you are conflating shared-kernel virtualisation (a family of techniques for building isolated namespaces for processes on a single kernel) with containers (a packaging and distribution model that depends on some isolation model that provides an isolated namespace).

                                                Because the main use case for Docker on macOS is to provide a development environment for people deploying things on Linux servers, it uses a port of FreeBSD’s bhyve to run a Linux VM as the isolation mechanism. Docker and containerd on Windows can use Hyper-V in a similar way to run both Windows and Linux containers and can also isolate Windows containers with shared-kernel virtualisation.

                                                You could provide isolation on macOS with the sandbox framework, but this does not allow namespace isolation. Chroot provides a subset of this. Please read the two papers that I mentioned or watch Bryan’s talk about them to understand what is missing.

                                                Kernels like Mach and Zircon completely elide a global namespace from their core abstractions and so are trivial to build shared-kernel virtualisation on top of, because ‘global’ namespaces are all things that are introduced by a rendezvous server that is provided to new processes on process creation. On traditional UNIX kernels it requires extra indirection.

                                                1. 2

                                                  You could provide isolation on macOS with the sandbox framework, but this does not allow namespace isolation.

                                                  I understand this, and I’ve seen the talk. My article didn’t use namespaces and cgroups on purpose, both because that has been done before and because I was trying to give an intuition to people using the simplest starting block. It says it’s a simplification but one I find useful in the first sentence and whole intro.

                                                  Note: I am not using the term ‘containers’ here, because containers are a packaging and distribution model

                                                  Images and the registry standard are the distribution model, in my thinking, containers are an instance of them running, but anyhow I don’t think the semantics of the terms is important here.

                                                  I’m certain that you know more about shared-kernel virtualization than me. I’m not questioning that. My article was just about “Hey, a process running on the same machine is a useful way to for people unfamilair how containers work to think about them”. In my thinking, however runc is implemented matters not for that. The intended audience and level of rigor might just be very different then you were expecting and I am sorry if it contains inaccuracies or is loose with terminology.

                                          2. 1

                                            The fact that you can seemlessly move from “basically chroot” to a full-blown VM without noticing is a qualitative difference, don’t you think?

                                            1. 1

                                              Was kind of tongue in cheek. Also: alias chrooot=docker run. Problem solved.

                                          3. 1

                                            When you run Docker on macOS to my knowledge you run a VM (xhyve, a port of FreeBSD’s bhyve to macOS) in which “the actual stuff” happens or did that change?

                                            1. 1

                                              I haven’t dug into this too closely because of the risk of barfing all over my keyboard when I find out how it actually works, but it would appear that when you run Docker on macOS M1 not only do you have xhyve, you also have qemu to emulate x86-64 so that[*] all the random images around the net work on your machine. There’s probably bubble gum and duct tape in there too.

                                              [*] not all. Some of them die with segmentation faults, apparently because the CPU that QEMU is emulating is not the same variety of CPU that J Random Docker Image Publisher used to compile his image, and some instructions aren’t emulated.

                                          4. 1

                                            I remember another in depth article talking about this stuff which showed off ways of running host-side binaries inside the container pretty easily. Google is failing me but it’s definitely interesting to peel back at the abstraction (at least on Linux, since Mac users end up with the VM layer that kinda breaks everything)