Great article… In other words, Nix is a leaky abstraction and it bottoms out at shell scripts, e.g. a wrapper to munge the linker command line to set RPATH to /nix/store.
This doesn’t appear in your code / package definitions and is hard to debug when it goes wrong.
Nix was designed and developed before Linux containers, so it had to use the /nix/store mechanism to achieve its goals (paths should identify immutable content, not mutable “places”).
So I wish we had a Nix like system based on containers … (and some kind of OverlayFS and /nix/store hybrid)
I don’t, I like running nix on macos. How would your “solution” of running linux binaries and filesystems on macos work exactly? This whole blog post amounted to using the wrong lld, I don’t see how “lets use containers” is a good fix for the problem at hand.
See this reply … I have a specific problem of running lightweight containers locally and remotely, and transfering them / storing them, and am not trying to do anything with OS X:
The background is that Oil contributors already tried to put Oil’s dev env in shell.nix, and run it on CI.
However my own shell scripts that invoke Docker/podman end up working better across Github Actions, sourcehut, local dev, etc. Docker definitely has design bugs, but it does solve the problem … I like that it is being “refactored away” and orthogonalized
However my own shell scripts that invoke Docker/podman end up working better across Github Actions, sourcehut, local dev, etc. Docker definitely has design bugs, but it does solve the problem … I like that it is being “refactored away” and orthogonalized
Its not though, you’re not solving the problem for “not linux” if you put everything in docker/containers. nix the package manager is just files and symlinks at the end of the day. It runs on more than just linux, any linux only “solution” is just that, not a solution that actually fits the problem space as nix. Which includes freebsd, macos, and linux.
The background is that Oil contributors already tried to put Oil’s dev env in shell.nix, and run it on CI.
And you can create containers from nix package derivations, and your linked comment doesn’t really explain your problem. I’ve used nix to create “lightweight” containers, the way nix stores files makes it rather nice as it avoids all that layering rubbish. But without a clear understanding of what exactly you’re talking about it really seems to be unrelated to this post entirely. How do you test Oil on macos/freebsd via CI? Even with those operating systems and nix, its its own world so you’d still have to test in and out of docker. Or am I misunderstanding? I’m still unclear what problem you are trying to accomplish and how it relates to rpath in a linker on nix and how containers solve it.
Yeah it’s true, if you have a OS X or BSD requirement then what I’m thinking of isn’t going to help, or at least it has all the same problems that Docker does on OS X (I think it runs in a VM)
I would have liked to have used Nix but it didn’t solve the problem well … It apparently doesn’t solve the “it works on my machine” problem. Apparently Nix Flakes solves that better? That is, whether it’s isolated/hermetic apparently depends on the build configuration. In contrast Bazel (with all its limitations) does pretty much solve that problem, independent of your build configuration is.
I think the setup also depended on some kind of experimental Nix support for Travis CI (and cachix?), and then Travis CI went down, etc.
I would be happy if some contributor told me this is all wrong and just fixed everything. But I don’t think that will happen because there are real problems it doesn’t address. Maybe Nix flakes will do it but in the meantime I think containers solved the problem more effectively.
It’s better to get into the details on Zulip, but shells are extremely portable so about 5% of the build needs to run on OS X, and that can be done with tarballs and whatnot, because it has few dependencies. The other 95% is a huge build and test matrix that involves shells that don’t exist on OS X like busybox ash, and many many tools for quality and metaprogramming.
Shells are also very low level, so we ran into the issue where Nix can’t sandbox libc. The libc on OS X pokes through, and that’s kind of fundamental as far as I can tell.
This problem is sufficiently like most of the other problems I’ve had in the past that I’m willing to spend some time on it … e.g. I’m interested in distributed systems and 99% of those run Linux kernels everywhere. If you’re writing OS X software or simply like using OS X a lot then it won’t be interesting. (I sometimes use OS X, but use a Linux VM for development.)
paths should identify immutable content, not mutable “places”
Dynamic linking to immutable content isn’t actually dynamic linking, its more like “deferred linking”. Optimize the deferred part away and you are back at static linking. The next optimization is to dedup libraries by static linking multiple programs into multicall binaries.
Do containers allow any kind of fs merging at the moment? I.e. similar results to overlayfs or merged symlinks dir from nix itself? I thought namespaces only allow basic mapping, so I’m curious where they could help.
Yeah I think you might want a hybrid of OverlayFS and bind mounts. There was an experiment mentioned here, and it was pointed out that it’s probably not a good idea to have as many layers as packages. Because you could have more than 128 packages that a binary depends on, and the kernel doesn’t like that many layers:
Here is what I’m thinking with bind mounts. I haven’t actually done this, but I’ve done similar things, and maybe some system already works like this – I wouldn’t be surprised. Let me know if it has already been done :)
So the point is to avoid all the RPATH stuff, and have more a “stock” build. IME the RPATH hacks are not necessarily terrible for C, but gets worse when you have dynamic modules in Python and R, which are shared libraries, which depend on other shared libraries. The build systems for most languages are annoying in this respect.
So you build inside a container, bind mounting both the tarball and the output /mydistro, which is analogous to /nix/store. And then do something like:
configure --prefix /mydistro/python && make && make install
But on the host side, /mydistro/python is actually /repo/python-3.9 or /repo/python-3.10.
So then at runtime, you do the same thing – bind mount /repo/python-3.9 as /mydistro/python
So then on the host you can have python 3.9 and 3.10 simultaneously. This is where the package manager downloads data to.
But apps themselves run inside a container, with their custom namespace. So with this scheme I think you should be able to mix and match Python versions dynamically because the built artifacts don’t have version numbers or /nix/store/HASH in their paths.
I would try bubblewrap first – I just tried it and it seemed nice.
There are obviously a bunch of other issues. The distri experiment has some notes on related issues, but I think this removes the RPATH hacks while still letting you have multiple versions installed.
If anyone tries it let me know :) It should be doable with a 20 line shell script and bubblewrap.
I actually have this problem because I want to use Python 3.10 pattern matching syntax to write a type checker! That was released in October and my distro doesn’t have it.
Right now I just build it outside the container, which is fine. But I think having apps explicitly limited to a file system with their dependencies mounted has a lot of benefits. It is a middleground between the big blob of Docker and more precise dependencies of Nix.
Yeah I think you might want a hybrid of OverlayFS and bind mounts. There was an experiment mentioned here, and it was pointed out that it’s probably not a good idea to have as many layers as packages. Because you could have more than 128 packages that a binary depends on, and the kernel doesn’t like that many layers:
The problem with overlay / union filesystems is that the problem that they’re trying to solve is intrinsically very hard. If a file doesn’t exist in the top layer then you need to traverse all lower layers to try to find it. If you can guarantee that the lower layers are immutable then you can cache this traversal and build a combined view of a directory once but if they might be mutated then you need to provide some cache invalidation scheme. You have to do the traversal in order because a file can be created in one layer, deleted in a layer above (which requires you to support some notion of whiteout: the intermediate FS layer needs to track the fact that the file was deleted), and then re-added at a layer above. You also get exciting behaviours if only part of a file is modified: if I have a 1GiB file and I modify the header, my overlay needs to either copy the whole thing to the top layer, or it needs to manage the diff. In the latter case, this gets very exciting if something in the lower layer modifies the file. There are a lot of corner cases like this that mean you have to either implement things in a very inefficient way that scales poorly, or you have surprising semantics.
This is why containerd uses snapshots as the abstraction, rather than overlays. If you have an overlay / union FS, then you can implement snapshots by creating a new immutable layer and, because the layer is immutable, you won’t hit any of the painful corner cases in the union FS. If you have a CoW filesystem, then snapshots are basically free. With something like ZFS, you can write a bunch of files, snapshot the filesystem, create a mutable clone of the snapshot, write / delete more files, and snapshot the result, and so on. Each of the snapshot layers is guaranteed immutable and any file that is unmodified from the previous snapshot shares storage. This means that the top ‘layers’ just have reference-counted pointers to the data and so accesses are O(1) in terms of the number of layers.
The one thing that you lose with the snapshot model is the ability to arbitrarily change the composition order. For example, if I have one layer that installs package A, one that installs packages B on top, and one that installs package C on top, and I want a layer that installs packages A and C, then I can’t just combine the top and bottom layers, I need to start with the first one and install package C. Something like Nix can probably make the guarantees that would make this safe (that modified in the middle layer is modified by the application of the top layer), but that’s not possible in general.
Hm yeah I have seen those weird issues with OverlayFS but not really experienced them … The reason I’m interseted in it is that I believe Docker uses it by default on most Linux distros. I think it used to use block-based solutions but I’m not entirely clear why they switched.
The other reason I like it is because the layers are “first class” more amenable to shell scripting than block devices.
And yes the idea behind the “vertical slices” is that they compose and don’t have ordering, like /nix/store.
The idea behind the “horizontal layer” is that I don’t want to bootstrap the base image and the compiler myself :-/ I just want to do apt-get install build-essential.
This is mainly for “rationalizing” the 5 containers I have in the Oil build, but I think it could be used to solve many problems I’ve had in the past.
And also I think it is simple enough to do from shell; I’m not buying into a huge distro, although this could evolve into one.
Basically I want to make the containers more fine-grained and composable. Each main() program should have its own lightweight container, a /bin/sh exec wrapper, and then you can compose those with shell scripts! (The continuous build is already a bunch of portable shell scripts.)
Also I want more sharing, which gives you faster transfers over the network and smaller overall size.
I am pretty sure this can solve my immediate problem – whether it generalizes I’m not sure, but I don’t see why not. For this project, desktop apps and OS X are out of scope.
(Also I point out in another comment that I’d like to learn about the overlap between this scheme and what Flatpak already does? i.e. the build tools and runtime, and any notion of repository and network transfer. I’ve already used bubblewrap)
Hm yeah I have seen those weird issues with OverlayFS but not really experienced them … The reason I’m interseted in it is that I believe Docker uses it by default on most Linux distros. I think it used to use block-based solutions but I’m not entirely clear why they switched.
Docker now is a wrapper around containerd and so uses the snapshot abstraction. OCI containers are defined in terms of layers that define deltas on existing layers (starting with an empty one). containerd provides caching for these layers by delegating to a snapshotting service, which can apply the deltas as an overlay layer (which it then never modifies, so avoiding all of the corner cases) or to a filesystem with CoW snapshots.
The other reason I like it is because the layers are “first class” more amenable to shell scripting than block devices.
I’m not sure what this means. ZFS snapshots, for example, can be mounted in .zfs/{snapshot name} as read-only trees.
Basically I want to make the containers more fine-grained and composable. Each main() program should have its own lightweight container, a /bin/sh exec wrapper, and then you can compose those with shell scripts! (The continuous build is already a bunch of portable shell scripts.)
To do this really nicely, I want some of the functionality from capsh, so I can use Capsicum, not jails, and have the shell open file descriptors easily for the processes that it spawns, rather than relying on trying to shim all of this into a private view of the global namespace.
I think you’re only speaking about BSD. containerd has the notion of “storage drivers”, and “overlay2” is the default storage driver on Linux. I think it changed 3-4 years ago
When I look at /var/lib/docker on my Ubuntu machine, it seems to confirm this – On Linux, Docker uses file level “differential” layers, not block-level snapshots. (And all this /var/lib/docker nonsense is what I’m criticizing on the blog. Docker is “anti-Unix”. Monolithic and code-centric not data-centric.)
So basically I want to continue what Red Hat and others are doing and continue “refactoring away” Docker, and just use OverlayFS. From my point of view they did a good job of getting that into the kernel, so now it is reasonable to rely on it. (I think there were 2 iterations of OverlayFS – the second version fixes or mitigates the problems you noted – I agree it is hard, but I also think it is solved.)
I think I wrote about it on the other thread, but I’m getting at a “remote/mobile process abstraction” with explicit data dependencies, mostly for batch processes. You need the data dependencies to be mobile. And I don’t want to introduce more concepts than necessary (according to the Perlis-Thompson principle and narrow waists), so just tarballs of files as layers, rather than block devices, seem ideal.
The blocks are dependent on a specific file system, i.e. ext3 or ext4. And also I don’t think you can do anything with an upper layer without the lower layers. With the file-level abstraction you can do that.
So it seems nicer not to introduce the constraint that all nodes have to be running the same file system – they merely all have to have OverlayFS, which is increasingly true.
None of this is going to be built directly into Oil – it’s a layer on top. So presumably BSDs could use Docker or whatever, or maybe the remote process abstraction can be ported.
Right now I’m just solving my own problem, which is very concrete, but as mentioned this is very similar to lots of problems I’ve had.
Of course Kubernetes and dozens of other systems going back years have remote/mobile process abstractions, but none of them “won”, and they are all coupled to a whole lot of other stuff. I want something that is minimal and composable from the shell, and that basically leads into “distributed shell scripting”.
I think all these systems were not properly FACTORED in the Unix sense. They were not narrow waists and didn’t compose with shell. They have only the most basic integration with shell.
For example our CI is just 5 parallel jobs with 5 Dockerfiles now:
I believe most CIs are like this – dumb, racy, without data dependencies, and with hard-coded schedules. So I would like to turn into something more fine-grained, parallel, and thus faster (but also more coarse-grained than Nix.) Basically by framing it in terms of shell, you get LANGUAGE-oriented composition.
(And of course, as previous blog posts say, a container-based build system should be the same thing as a CI system; there shouldn’t be anything you can only run remotely.)
I looked at Capsicum many years ago but haven’t seen capsh… For better or worse Oil is stuck on the lowest common denominator of POSIX, but the remote processes can be built on top, and right now that part feels Linux-only. I wasn’t really aware that people used Docker on BSD and I don’t know anything about it … (I did use NearlyFreeSpeech and their “epochs” based on BSD jails – it’s OK but not as flexible as what I want. It’s more on the admin side than the user side.)
I think you’re only speaking about BSD. containerd has the notion of “storage drivers”, and “overlay2” is the default storage driver on Linux. I think it changed 3-4 years ago
No, I’m talking about the abstractions that containerd uses. It can use overlay filesystems to implement a snapshot abstraction. Docker tried to do this the other way around and use snapshots to implement an overlay abstraction but this doesn’t work well and so containerd inverted it. This is in the docs.
When I look at /var/lib/docker on my Ubuntu machine, it seems to confirm this – On Linux, Docker uses file level “differential” layers, not block-level snapshots
Snapshots don’t have to be at the block level, they can be at the file level. There are various snapshotters in containerd that implement the same abstraction in different ways. The key point is that each layer is a delta that is applied to one specific immutable thing below.
I’m not really sure what the rest of your post is talking about. You seem to be conflating abstractions and implementation.
OK I think you were misunderstanding what I was talking about in the original message. What I’m proposing uses OverlayFS with immutable layers. Any mutable state is outside the container and mounted in at runtime. It’s more like an executable than a container.
Apparently it is mostly for desktop apps? I don’t see why that would be since CLI apps and server apps should be strictly easier.
I think the main difference again would be the mix of layers and slices, so you have less build configuration. And also naming them as first class on the file system and dynamically mixing and matching. What I don’t like is all the “boiling the ocean” required for packaging, e.g. RPATH but also a lot of other stuff …
I have Snap on my Ubuntu desktop but I am trying to avoid it… Maybe Flatpak is better, not sure.
That sounds like there’s a generic “distro python” though, which… is not necessarily true. You could definitely want environments with both python3.10 and 3.9 installed and not conflicting at the same time.
The model I’m going for is that you’re not really “inside” a container … But each main() program uses a lightweight container for its dependencies. So I can’t really imagine any case where a single main() uses both Python 3.9 and 3.10.
If you have p39 and p10 and want to pipe them together, you pipe together two DIFFERENT containers. You don’t pipe them together inside the container. It’s more like the model of iOS or Android, and apps are identified by a single hash that is a hash of the dependencies, which are layers/slices.
BUT importantly they can share layers / “slices” underneath, so it’s not as wasteful as snap/flatpak and such.
I’ve only looked a little at snap / flatpak, but I think they are more heavyweight, it’s like you’re inside a “machine” and not just assembling namespaces. I imagine an exec wrapper that makes each script isolated
# this script behaves exactly like p310.py and can be piped together with other scripts
exec bwrap --mount foo foo --mount p310.py p310.py -- /bin/p310.py "$@"
Your idea kinda sounds like GoboLinux Runner, but I can’t tell if it’s exactly the same, since it’s been a long time since I played with GoboLinux. It’s a very interesting take on Linux, flipping the FHS on it’s head just like Nix or Guix, but still keeping the actual program store fully user accessible, and mostly manageable without special commands.
Ah interesting, I heard about GoboLinux >10 years ago but it looks like they made some interesting developments.
They say they are using a “custom mount table” and that is essentially what bubblewrap lets you do. You just specify a bunch of --mount flags and it makes mount() syscalls before exec-ing the program.
If I recall correctly, we had to add RPATH in the CHICKEN build system for NetBSD years ago because it doesn’t have /usr/pkg/lib (the default path for packages installed through pkgsrc) in its dynamic search path. For NetBSD, this also makes sense, as these packages are optional and may include newer versions of libraries also shipped in the base system. I think it also makes building to a DESTDIR (staging area for the built binaries) possible/easier, so that the libraries are somewhere else while building than where they will be installed to.
We decided to bake in RPATH to all binaries on all platforms that support it, to ensure a consistent build with binaries that behave sanely regardless of the dynamic search path. This makes a lot of sense, and it also makes it trivial to install the entire build into your home directory, for example, without having to mess about with your dynamic linker settings or LD_LIBRARY_PATH or what have you.
Funny thing, fast-forward to 2022 and we get an e-mail from an Alpine package maintainer requesting we don’t add RPATH, because the rpath is in the default search path already, leading to some sort of redundancy, which the Alpine package build warns about. Unfortunately, he never got back to me on my question of why this redundancy would be a problem.
Great article… In other words, Nix is a leaky abstraction and it bottoms out at shell scripts, e.g. a wrapper to munge the linker command line to set RPATH to /nix/store.
This doesn’t appear in your code / package definitions and is hard to debug when it goes wrong.
Nix was designed and developed before Linux containers, so it had to use the
/nix/store
mechanism to achieve its goals (paths should identify immutable content, not mutable “places”).So I wish we had a Nix like system based on containers … (and some kind of OverlayFS and /nix/store hybrid) Related: https://lobste.rs/s/psfsfo/curse_nixos#c_ezjimo
https://lobste.rs/s/ui7wc4/nix_idea_whose_time_has_come#c_5j3zmc
I don’t, I like running nix on macos. How would your “solution” of running linux binaries and filesystems on macos work exactly? This whole blog post amounted to using the wrong lld, I don’t see how “lets use containers” is a good fix for the problem at hand.
this is the proper sentiment
See this reply … I have a specific problem of running lightweight containers locally and remotely, and transfering them / storing them, and am not trying to do anything with OS X:
https://lobste.rs/s/vadunt/rpath_why_lld_doesn_t_work_on_nixos#c_pb8cpo
The background is that Oil contributors already tried to put Oil’s dev env in
shell.nix
, and run it on CI.However my own shell scripts that invoke Docker/podman end up working better across Github Actions, sourcehut, local dev, etc. Docker definitely has design bugs, but it does solve the problem … I like that it is being “refactored away” and orthogonalized
Its not though, you’re not solving the problem for “not linux” if you put everything in docker/containers. nix the package manager is just files and symlinks at the end of the day. It runs on more than just linux, any linux only “solution” is just that, not a solution that actually fits the problem space as nix. Which includes freebsd, macos, and linux.
And you can create containers from nix package derivations, and your linked comment doesn’t really explain your problem. I’ve used nix to create “lightweight” containers, the way nix stores files makes it rather nice as it avoids all that layering rubbish. But without a clear understanding of what exactly you’re talking about it really seems to be unrelated to this post entirely. How do you test Oil on macos/freebsd via CI? Even with those operating systems and nix, its its own world so you’d still have to test in and out of docker. Or am I misunderstanding? I’m still unclear what problem you are trying to accomplish and how it relates to rpath in a linker on nix and how containers solve it.
Yeah it’s true, if you have a OS X or BSD requirement then what I’m thinking of isn’t going to help, or at least it has all the same problems that Docker does on OS X (I think it runs in a VM)
The history of the problem is long, it’s all on https://oilshell.zulipchat.com/ and there is a shell.nix here
https://github.com/oilshell/oil/blob/master/shell.nix
which is not widely used. Instead the CI now uses 5 parallel jobs with 5 Dockerfiles, which I want to refactor into something more fine-grained.
https://github.com/oilshell/oil/tree/master/soil
I would have liked to have used Nix but it didn’t solve the problem well … It apparently doesn’t solve the “it works on my machine” problem. Apparently Nix Flakes solves that better? That is, whether it’s isolated/hermetic apparently depends on the build configuration. In contrast Bazel (with all its limitations) does pretty much solve that problem, independent of your build configuration is.
I think the setup also depended on some kind of experimental Nix support for Travis CI (and cachix?), and then Travis CI went down, etc.
I would be happy if some contributor told me this is all wrong and just fixed everything. But I don’t think that will happen because there are real problems it doesn’t address. Maybe Nix flakes will do it but in the meantime I think containers solved the problem more effectively.
It’s better to get into the details on Zulip, but shells are extremely portable so about 5% of the build needs to run on OS X, and that can be done with tarballs and whatnot, because it has few dependencies. The other 95% is a huge build and test matrix that involves shells that don’t exist on OS X like busybox ash, and many many tools for quality and metaprogramming.
Shells are also very low level, so we ran into the issue where Nix can’t sandbox libc. The libc on OS X pokes through, and that’s kind of fundamental as far as I can tell.
This problem is sufficiently like most of the other problems I’ve had in the past that I’m willing to spend some time on it … e.g. I’m interested in distributed systems and 99% of those run Linux kernels everywhere. If you’re writing OS X software or simply like using OS X a lot then it won’t be interesting. (I sometimes use OS X, but use a Linux VM for development.)
Dynamic linking to immutable content isn’t actually dynamic linking, its more like “deferred linking”. Optimize the deferred part away and you are back at static linking. The next optimization is to dedup libraries by static linking multiple programs into multicall binaries.
Do containers allow any kind of fs merging at the moment? I.e. similar results to overlayfs or merged symlinks dir from nix itself? I thought namespaces only allow basic mapping, so I’m curious where they could help.
Yeah I think you might want a hybrid of OverlayFS and bind mounts. There was an experiment mentioned here, and it was pointed out that it’s probably not a good idea to have as many layers as packages. Because you could have more than 128 packages that a binary depends on, and the kernel doesn’t like that many layers:
https://lobste.rs/s/psfsfo/curse_nixos#c_muaunf
Here is what I’m thinking with bind mounts. I haven’t actually done this, but I’ve done similar things, and maybe some system already works like this – I wouldn’t be surprised. Let me know if it has already been done :)
Say you want to install Python 3.9 and Python 3.10 together, which Nix lets you do. (As an aside, the experiment distri also lets you do that, so it also has these RPATH issues to solve, mentioned here: https://michael.stapelberg.ch/posts/2020-05-09-distri-hermetic-packages/ )
So the point is to avoid all the RPATH stuff, and have more a “stock” build. IME the RPATH hacks are not necessarily terrible for C, but gets worse when you have dynamic modules in Python and R, which are shared libraries, which depend on other shared libraries. The build systems for most languages are annoying in this respect.
So you build inside a container, bind mounting both the tarball and the output /mydistro, which is analogous to /nix/store. And then do something like:
But on the host side, /mydistro/python is actually /repo/python-3.9 or /repo/python-3.10.
So then at runtime, you do the same thing – bind mount /repo/python-3.9 as /mydistro/python
So then on the host you can have python 3.9 and 3.10 simultaneously. This is where the package manager downloads data to.
But apps themselves run inside a container, with their custom namespace. So with this scheme I think you should be able to mix and match Python versions dynamically because the built artifacts don’t have version numbers or /nix/store/HASH in their paths.
I would try bubblewrap first – I just tried it and it seemed nice.
So the runtime would end up as something like
and
There are obviously a bunch of other issues. The distri experiment has some notes on related issues, but I think this removes the RPATH hacks while still letting you have multiple versions installed.
If anyone tries it let me know :) It should be doable with a 20 line shell script and bubblewrap.
I actually have this problem because I want to use Python 3.10 pattern matching syntax to write a type checker! That was released in October and my distro doesn’t have it.
Right now I just build it outside the container, which is fine. But I think having apps explicitly limited to a file system with their dependencies mounted has a lot of benefits. It is a middleground between the big blob of Docker and more precise dependencies of Nix.
The problem with overlay / union filesystems is that the problem that they’re trying to solve is intrinsically very hard. If a file doesn’t exist in the top layer then you need to traverse all lower layers to try to find it. If you can guarantee that the lower layers are immutable then you can cache this traversal and build a combined view of a directory once but if they might be mutated then you need to provide some cache invalidation scheme. You have to do the traversal in order because a file can be created in one layer, deleted in a layer above (which requires you to support some notion of whiteout: the intermediate FS layer needs to track the fact that the file was deleted), and then re-added at a layer above. You also get exciting behaviours if only part of a file is modified: if I have a 1GiB file and I modify the header, my overlay needs to either copy the whole thing to the top layer, or it needs to manage the diff. In the latter case, this gets very exciting if something in the lower layer modifies the file. There are a lot of corner cases like this that mean you have to either implement things in a very inefficient way that scales poorly, or you have surprising semantics.
This is why
containerd
uses snapshots as the abstraction, rather than overlays. If you have an overlay / union FS, then you can implement snapshots by creating a new immutable layer and, because the layer is immutable, you won’t hit any of the painful corner cases in the union FS. If you have a CoW filesystem, then snapshots are basically free. With something like ZFS, you can write a bunch of files, snapshot the filesystem, create a mutable clone of the snapshot, write / delete more files, and snapshot the result, and so on. Each of the snapshot layers is guaranteed immutable and any file that is unmodified from the previous snapshot shares storage. This means that the top ‘layers’ just have reference-counted pointers to the data and so accesses are O(1) in terms of the number of layers.The one thing that you lose with the snapshot model is the ability to arbitrarily change the composition order. For example, if I have one layer that installs package A, one that installs packages B on top, and one that installs package C on top, and I want a layer that installs packages A and C, then I can’t just combine the top and bottom layers, I need to start with the first one and install package C. Something like Nix can probably make the guarantees that would make this safe (that modified in the middle layer is modified by the application of the top layer), but that’s not possible in general.
Hm yeah I have seen those weird issues with OverlayFS but not really experienced them … The reason I’m interseted in it is that I believe Docker uses it by default on most Linux distros. I think it used to use block-based solutions but I’m not entirely clear why they switched.
The other reason I like it is because the layers are “first class” more amenable to shell scripting than block devices.
And yes the idea behind the “vertical slices” is that they compose and don’t have ordering, like /nix/store.
The idea behind the “horizontal layer” is that I don’t want to bootstrap the base image and the compiler myself :-/ I just want to do
apt-get install build-essential
.This is mainly for “rationalizing” the 5 containers I have in the Oil build, but I think it could be used to solve many problems I’ve had in the past.
And also I think it is simple enough to do from shell; I’m not buying into a huge distro, although this could evolve into one.
Basically I want to make the containers more fine-grained and composable. Each main() program should have its own lightweight container, a /bin/sh exec wrapper, and then you can compose those with shell scripts! (The continuous build is already a bunch of portable shell scripts.)
Also I want more sharing, which gives you faster transfers over the network and smaller overall size.
I am pretty sure this can solve my immediate problem – whether it generalizes I’m not sure, but I don’t see why not. For this project, desktop apps and OS X are out of scope.
(Also I point out in another comment that I’d like to learn about the overlap between this scheme and what Flatpak already does? i.e. the build tools and runtime, and any notion of repository and network transfer. I’ve already used bubblewrap)
Docker now is a wrapper around
containerd
and so uses the snapshot abstraction. OCI containers are defined in terms of layers that define deltas on existing layers (starting with an empty one).containerd
provides caching for these layers by delegating to a snapshotting service, which can apply the deltas as an overlay layer (which it then never modifies, so avoiding all of the corner cases) or to a filesystem with CoW snapshots.I’m not sure what this means. ZFS snapshots, for example, can be mounted in
.zfs/{snapshot name}
as read-only trees.To do this really nicely, I want some of the functionality from capsh, so I can use Capsicum, not jails, and have the shell open file descriptors easily for the processes that it spawns, rather than relying on trying to shim all of this into a private view of the global namespace.
I think you’re only speaking about BSD. containerd has the notion of “storage drivers”, and “overlay2” is the default storage driver on Linux. I think it changed 3-4 years ago
https://docs.docker.com/storage/storagedriver/select-storage-driver/
When I look at /var/lib/docker on my Ubuntu machine, it seems to confirm this – On Linux, Docker uses file level “differential” layers, not block-level snapshots. (And all this /var/lib/docker nonsense is what I’m criticizing on the blog. Docker is “anti-Unix”. Monolithic and code-centric not data-centric.)
So basically I want to continue what Red Hat and others are doing and continue “refactoring away” Docker, and just use OverlayFS. From my point of view they did a good job of getting that into the kernel, so now it is reasonable to rely on it. (I think there were 2 iterations of OverlayFS – the second version fixes or mitigates the problems you noted – I agree it is hard, but I also think it is solved.)
I think I wrote about it on the other thread, but I’m getting at a “remote/mobile process abstraction” with explicit data dependencies, mostly for batch processes. You need the data dependencies to be mobile. And I don’t want to introduce more concepts than necessary (according to the Perlis-Thompson principle and narrow waists), so just tarballs of files as layers, rather than block devices, seem ideal.
The blocks are dependent on a specific file system, i.e. ext3 or ext4. And also I don’t think you can do anything with an upper layer without the lower layers. With the file-level abstraction you can do that.
So it seems nicer not to introduce the constraint that all nodes have to be running the same file system – they merely all have to have OverlayFS, which is increasingly true.
None of this is going to be built directly into Oil – it’s a layer on top. So presumably BSDs could use Docker or whatever, or maybe the remote process abstraction can be ported.
Right now I’m just solving my own problem, which is very concrete, but as mentioned this is very similar to lots of problems I’ve had.
Of course Kubernetes and dozens of other systems going back years have remote/mobile process abstractions, but none of them “won”, and they are all coupled to a whole lot of other stuff. I want something that is minimal and composable from the shell, and that basically leads into “distributed shell scripting”.
I think all these systems were not properly FACTORED in the Unix sense. They were not narrow waists and didn’t compose with shell. They have only the most basic integration with shell.
For example our CI is just 5 parallel jobs with 5 Dockerfiles now:
https://github.com/oilshell/oil/tree/master/soil
So logically it looks like this:
I believe most CIs are like this – dumb, racy, without data dependencies, and with hard-coded schedules. So I would like to turn into something more fine-grained, parallel, and thus faster (but also more coarse-grained than Nix.) Basically by framing it in terms of shell, you get LANGUAGE-oriented composition.
(And of course, as previous blog posts say, a container-based build system should be the same thing as a CI system; there shouldn’t be anything you can only run remotely.)
I looked at Capsicum many years ago but haven’t seen capsh… For better or worse Oil is stuck on the lowest common denominator of POSIX, but the remote processes can be built on top, and right now that part feels Linux-only. I wasn’t really aware that people used Docker on BSD and I don’t know anything about it … (I did use NearlyFreeSpeech and their “epochs” based on BSD jails – it’s OK but not as flexible as what I want. It’s more on the admin side than the user side.)
No, I’m talking about the abstractions that
containerd
uses. It can use overlay filesystems to implement a snapshot abstraction. Docker tried to do this the other way around and use snapshots to implement an overlay abstraction but this doesn’t work well and socontainerd
inverted it. This is in the docs.Snapshots don’t have to be at the block level, they can be at the file level. There are various snapshotters in
containerd
that implement the same abstraction in different ways. The key point is that each layer is a delta that is applied to one specific immutable thing below.I’m not really sure what the rest of your post is talking about. You seem to be conflating abstractions and implementation.
OK I think you were misunderstanding what I was talking about in the original message. What I’m proposing uses OverlayFS with immutable layers. Any mutable state is outside the container and mounted in at runtime. It’s more like an executable than a container.
Adding to my own comment, if anyone has experience with Flatpak I’d be interested (since it uses bubblewrap):
https://dev.to/bearlike/flatpak-vs-snaps-vs-appimage-vs-packages-linux-packaging-formats-compared-3nhl
Apparently it is mostly for desktop apps? I don’t see why that would be since CLI apps and server apps should be strictly easier.
I think the main difference again would be the mix of layers and slices, so you have less build configuration. And also naming them as first class on the file system and dynamically mixing and matching. What I don’t like is all the “boiling the ocean” required for packaging, e.g. RPATH but also a lot of other stuff …
I have Snap on my Ubuntu desktop but I am trying to avoid it… Maybe Flatpak is better, not sure.
That sounds like there’s a generic “distro python” though, which… is not necessarily true. You could definitely want environments with both python3.10 and 3.9 installed and not conflicting at the same time.
The model I’m going for is that you’re not really “inside” a container … But each main() program uses a lightweight container for its dependencies. So I can’t really imagine any case where a single main() uses both Python 3.9 and 3.10.
If you have p39 and p10 and want to pipe them together, you pipe together two DIFFERENT containers. You don’t pipe them together inside the container. It’s more like the model of iOS or Android, and apps are identified by a single hash that is a hash of the dependencies, which are layers/slices.
BUT importantly they can share layers / “slices” underneath, so it’s not as wasteful as snap/flatpak and such.
I’ve only looked a little at snap / flatpak, but I think they are more heavyweight, it’s like you’re inside a “machine” and not just assembling namespaces. I imagine an exec wrapper that makes each script isolated
Your idea kinda sounds like GoboLinux Runner, but I can’t tell if it’s exactly the same, since it’s been a long time since I played with GoboLinux. It’s a very interesting take on Linux, flipping the FHS on it’s head just like Nix or Guix, but still keeping the actual program store fully user accessible, and mostly manageable without special commands.
Ah interesting, I heard about GoboLinux >10 years ago but it looks like they made some interesting developments.
They say they are using a “custom mount table” and that is essentially what bubblewrap lets you do. You just specify a bunch of
--mount
flags and it makes mount() syscalls before exec-ing the program.https://github.com/containers/bubblewrap/blob/main/demos/bubblewrap-shell.sh
I will look into it, thanks!
Thanks, I hate it.
If I recall correctly, we had to add
RPATH
in the CHICKEN build system for NetBSD years ago because it doesn’t have/usr/pkg/lib
(the default path for packages installed through pkgsrc) in its dynamic search path. For NetBSD, this also makes sense, as these packages are optional and may include newer versions of libraries also shipped in the base system. I think it also makes building to aDESTDIR
(staging area for the built binaries) possible/easier, so that the libraries are somewhere else while building than where they will be installed to.We decided to bake in
RPATH
to all binaries on all platforms that support it, to ensure a consistent build with binaries that behave sanely regardless of the dynamic search path. This makes a lot of sense, and it also makes it trivial to install the entire build into your home directory, for example, without having to mess about with your dynamic linker settings orLD_LIBRARY_PATH
or what have you.Funny thing, fast-forward to 2022 and we get an e-mail from an Alpine package maintainer requesting we don’t add
RPATH
, because the rpath is in the default search path already, leading to some sort of redundancy, which the Alpine package build warns about. Unfortunately, he never got back to me on my question of why this redundancy would be a problem.