1. 58
  1.  

  2. 17

    People seem to be missing the forest for the trees in this thread. The whole point of multi-user OSes was to compartmentalize processes into unique namespaces- a problem we’ve solved again thanks to containers. The issue is that containers are a wrecking ball solution to the problem, when maybe a sledgehammer (which removes some of our assumptions about resource allocation) sufficed.

    For example, running a web server. If you’re in a multi-tenant environment, and you want to run multiple webservers on port 80, why not… compartmentalize that, instead of building this whole container framework.

    Honestly, I think this article raises a point that it didn’t meant to: the current shitshow that is the modern micro-service/cloud architecture landscape resulted from an overly conservative OS community. I understand the motivations for conservatism in OS communities, but we can see a clear result: process developers solving problems OSes should solve, badly. Because the developers working in the userspace aren’t working from the same perspective as developers working in the OS space, so these developers come up with the “elegant” solution of bundling subsets of the OS into their processes. The parts they need. The parts they care about. When the real problem was that the OS should have been providing them the services they needed, and thus the whole problem would have been solved with like, 10% of the total RAM consumption.

    1. 3

      This is reasonable… except when you mentioned RAM consumption:

      Although containers themselves have almost no overhead, Docker is not without performance gotchas. Docker volumes have noticeably better performance than files stored in AUFS. Docker’s NAT also introduces overhead for workloads with high packet rates. These features represent a tradeoff between ease of management and performance and should be consid- ered on a case-by-case basis.

      Run containers with host networking on the base filesystem and there is no difference. Our wrecking balls weigh the same as our sledgehammers.

      http://domino.research.ibm.com/library/cyberdig.nsf/papers/0929052195DD819C85257D2300681E7B/$File/rc25482.pdf

      1. 6

        The problem isn’t really RAM or CPU weight, though the article uses that aspect to get its catchy title. The problem is unnecessary complexity.

        Complexity is brittle, bug-prone, security-hole-prone, and imposes a high cognitive load.

        The container/VM shitshow is massively complex compared to just doing multi-tenancy right.

        Simpler solutions are nearly always better. Next best is to find a way to hide a lot of complexity and make it irrelevant (sweep it under the rug, then staple the rug to the floor).

        1. 1

          Complexity is brittle, bug-prone, security-hole-prone, and imposes a high cognitive load.

          The container/VM shitshow is massively complex compared to just doing multi-tenancy right.

          Citation needed. This article is proof that doing “multi-tenancy right” requires additional complexity too: designing a system where unprivileged users can open ports <1024. Doing “multi-tenancy right” also requires disk memory and CPU quotas and accounting (cgroups), process ID namespaces (otherwise you can spy on your colleagues), user id namespaces so you can mount an arbitrary filesystem, etc, etc, etc.

          BSD jails have had all of these “complexities” for 20 years, and yet no one gripes about them. I suspect it’s just because linux containers are new and people don’t like learning new things.

    2. 12

      If you’re on FreeBSD, you can use mac_portacl(4) to permit specific users or groups to bind to privileged ports. e.g. for a webserver you might have this in your /etc/sysctl.conf:

      net.inet.ip.portrange.reservedhigh=0
      security.mac.portacl.rules="uid:80:tcp:80,uid:80:tcp:443"
      

      pf’s also capable of enforcing rules based on a socket’s owning user and group.

      1. 1

        Bingo, this is the sane way to solve it. You really don’t want to kill privileged ports entirely unless you’re ok with someone taking over port 22

      2. 7

        … You’ll have to make it use folders under /home/yourname/… but eventually you’ll have your own local MySQL server running in your own local user-space.

        Containers are just namespaces. Yes, namespacing packages costs some disk space (you can make this very small with Alpine, advanced filesystem tricks, etc), but so does installing packages in /home/$USER! You don’t have to install a massive CentOS image in your container (but you can, and that’s sort of neat).

        Multi-tenancy on Linux is better now than anyone from the 1970s ever dreamed possible (systemd even has nice support for multiple independent physical display/keyboard/mouse sets!). There may be a valid gripe to be had about aesthetics, but not about resource overhead, or really even fundamental complexity. Even privileged ports can be shared with inetd, sudo, iptables, or linux capabilities.

        1. 3

          You don’t have to install a massive CentOS image in your container

          I want to highlight this because it bears repeating.

          It’s absolutely possible to create a statically linked binary and place that - and nothing more - in a Docker container.

          Traefik, a golang loadbalancer, does this - the container has the binary and a trusted list of CAs. As a result the whole container clocks in at 12mb

        2. 6

          I’m an admin on a public shell server where we basically do this. As part of creating a new user, we give them a home space (shell-server.example/~user and user.shell-server.example), and create MySQL credentials for them. Then you’re not stuck with ~6k instances of MySQL or what have you. With some extra work, we could support proxying that site to a Unix socket in their directory (e.g. to support a web app). It’s worked well enough for a couple years now, and is essentially performing the same functions as the SparcStation in the article.

          1. 5

            I’m an admin on a public shell server where we basically do this

            That’s cool! What do people use it for? Do you rely on goodwill, or did you do fancy things with the security setup? Is it written up anywhere?

            1. 1

              A combination of application process, a variety of shell / perl scripts, and some locking down of the operating system. Some of the admins gave a talk at vBSDCon in 2016 that might be interesting, but it’s not really written up anywhere.

          2. 6

            Erm, so I disable priv ports. I start a web server on port 80. Little Timmy comes along and starts a web server on port 80. What happens now?

            1. 3

              Timmy’s call to bind() fails, because the port is already in use by you.

              1. 4

                Then how is this actually useful for running multiple web servers on the same box? Wouldn’t it end up in a free-for-all, with the first user who starts up their Wordpress install getting port 80, while the rest have to contend with another, non-standard port?

                1. 12

                  What *nix really needs is the ability to assign users ownership to IP addresses. With IPv6 you could assign every machine a /96 and then map all UIDs onto IP space.

                  This is probably a better idea than even getting rid of privileged ports. You can bind to a privileged port if you have rw access to the IP.

                  The real issue here is that Unix has no permissions scheme for IPs the way it does for files, etc.

                  1. 5

                    Its not so very much code to write a simple daemon that watches a directory of UNIX sockets, then binds to the port of the same name, forwarding all traffic. Like UNIX programming 101 homework easy. One can certainly argue its a hack, but its possible and its been possible for 20 years if that’s what people wanted. No kernel changes required.

                    I think theres a corollary to necessity is the mother of all invention. If it hasn’t been invented, its not necessary. To oversimplify a bit.

                2. 2

                  Sounds like Timmy needs a VM, so now I’m unclear on exactly how we’ve solved the energy crisis.

                  1. [Comment removed by author]

                    1. 2

                      Well, what happens when I grab 10.0.0.2 too? And .3 and .4?

                      There needs to be an address broker at some level, and I’m not convinced it’s impossible for that broker to be nginx.conf proxying a dozen different IPs to a dozen different unix sockets. There’s a fairly obvious solution to the problem that doesn’t involve redesigning everything.

                      So why then does AWS offer VMs instead of jamming a hundred users onto a single Linux image? Well, what if I want to run FreeBSD? VM offers a nice abstraction to allow me run a different operating system entirely. Now maybe this is an argument for exokernels and rump kernels and so forth, but I didn’t really see that being proposed.

                      1. [Comment removed by author]

                        1. 6

                          OK, sorry, didn’t mean to be argumentative. But it’s a really long article, so I could only keep some of it in my head, and it got a lot of upvotes, so I’m trying to mine out what the insights are. But don’t feel personally obligated to explain. :)

                          There seemed to be a metapoint that things are inefficient because we’re using some old design from another era and it’s obsolete. But I didn’t see much discussion of why we can’t keep the design we have and use the tools we have in a slightly better way. Like nginx.conf to multiplex. Shared web hosting used to be a thing, right?

                          1. 4

                            I feel the metapoint was the opposite. The author wanted to go back to the old way things were done, but simply allow users to have their own IP address in the same way they have their own home directory.

                            You can already add many IP addresses to a single machine in BSD and Linux. In Linux (don’t know about BSD), you can even create virtual sub-interfaces that have their own info, but reside on the same physical interface. The author wanted unix permissions on interfaces too, rwx = read write bind. So your hypothetical user Timmy user would have /home/timmy and eth0:timmy, with rwx on /home/timmy, and r-x on eth0:timmy. They would be able to read their IP, MAC, etc, and bind to it, but not change it.

                            1. 2

                              Shared web hosting used to be a thing. I think people have realised that hosting a website means running code, one way or another, and traditional unix was never really suited to the idea that there would be multiple people organizing code on the same machine: multiple users yes, but unix is very much single-administrator.

                              More concretely, library packaging/versioning sucks: it’s astonishingly difficult to simply have multiple versions of a shared library installed and have different executables use the versions they specify. Very few (OS-native) packaging systems support installing a package per-user at all. Even something like running your website under a specific version of python is hard on shared hosting. And OS-level packaging really hasn’t caught up with the Cambrian explosion of ways to do data storage: people have realised that traditional square-tables-and-SQL has a lot of deficiencies but right now that translates into everyone and their dog writing their own storage engine. No doubt it will shake out and consolidate eventually, but for now an account on the system MySQL doesn’t cut it but the system has no mechanism in place for offering the user persistence-service-of-the-week.

                              Personal view: traditional unix shared too much - when resources were very tight and security not very important it made sense to optimize for efficiency over isolation, but now the opposite is true. I see unikernels on a hypervisor as, in many ways, processes-on-a-shared-OS done right, and something like Qubes - isolation by default, sharing and communication only when explicitly asked for, and legacy compatibility via VMs - as the way forward.

                              1. 1

                                Isn’t this exactly the problem solved by virtualenv and such? I’ve never found it especially difficult to install my own software. There was a big nullprogram post about doing exactly this recently.

                                There are some challenges for sure, but I get the sense that people just threw their hands in the air, decided to docker everything, and allowed the situation to decay.

                                1. 1

                                  virtualenv has never worked great: a lot of Python libraries are bindings to system C libraries and depend on those being installed at the correct version. And there’s a bunch of minor package-specific fiddling because running in virtualenv is slightly different from running on native python.

                                  People reached for the sledgehammer of docker because it solved their problem, because fundamentally its UX is a lot nicer than virtualenv’s. Inefficient but reliable beats hand-tuning.

                              2. [Comment removed by author]

                                1. 1

                                  You can’t quite use namespaces that way. Net namespaces are attached to a process group, not a user. But doing something like I described would truly assign one IP address to a user. That user would have that IP address always. They would ssh to it, everything they started would bind to it by default, and so on. It would be their home IP in the same way their home directory is theirs.

                                  1. 1

                                    Docker is mentioned as also bloat because of image for each container.

                                    Container and layer sprawl can be real. I can’t deny that :)

                                    But you have two options to mitigate that:

                                    1. Build your dockerfile FROM scratch and copy in static binaries. If you’re doing C, or Go, this works very well

                                    2. Pick a common root - Alpine Linux (FROM alpine) is popular since it is fairly small. Once that is fetched, any container that references it will reuse it - so your twenty containers will not all go download the same Linux system.

                          2. 1

                            They have different ip addresses, There must be some way to use multiple addresses on the same linux install and if there isnt it would be easy to add.

                        2. 2

                          From the article: network service multi-tenancy. What does that even mean? Good question. I think that in his ideal world we’d be using network namespaces and we’d assign more ips per machine.

                          Honestly it sounds like despite his concerns about container overhead, his proposal is basically to use containers/namespaces. Not sure why he thinks they are “bloated”.

                          1. 3

                            A few numbers would certainly make the overhead argument more concrete. Every VM has its own kernel and init and libc. So that’s a few dozen megabytes? But a drop in the bucket compared to the hundreds of megabytes used by my clojure web app. So if I’m provisioning my super server with lots of user accounts, I can get away with giving each user 960MB instead of 1024MB like I would for a VM? Is that roughly the kind of savings we’re talking about?

                        3. 5

                          Linux has network namespaces and capabilities, both of which could move the author a bit closer to what he’s after.

                          However, in reality the Linux kernel is probably not something you’d want to rely upon for security between users who can execute code. That’s why folks move toward hypervisors – not that they’re a panacea (hypervisor attacks aren’t uncommon), but they at least have a somewhat smaller TCB.

                          1. 5

                            I really like that they break down some of the reasons why someone would want to use containers rather than virtual machines. Namely that virtual machines require a minimum amount of resources to run simply because you have an entire OS to run along with an application, vs a container that requires far less resources on the same hardware. When you are able to run the same services with less system resources you also use less electricity, save money for your company and ultimately help slow down climate change. Its an interesting point that I hadn’t considered before.

                            1. 3

                              Among other issues with this article, I don’t think it has the history of virtualization and containerization correct. The proliferation of websites on hosting providers happened well before virtualization and containerization started to be a thing, and providers tended to have reasonably well baked solutions to manage the whole complex result (and generally still do today).

                              I believe that it is also the case that general use of isolation is driven in part by customer demand, not hosting provider convenience. People like having an isolated system with exactly the things that they want in it, running their preferred version of this, that, and the other. They are not as enthused about sharing a host with a fixed OS, a fixed distribution, often fixed package versions, and so on.

                              1. 3

                                If you really want to do vastly multi-tenant web sites on one box, then having unprivileged users be able to bind low ports like 80 and 443 isn’t enough by itself; they all need to either:

                                • have their own network address, on which they’re the only person who can bind port 80. Doable with IPv6, not a great story in IPv4 land.
                                • or they need to share access to port 80s and 443 somehow, perhaps with a multiplexing HTTP(S) reverse proxy.

                                In the latter case, if the HTTP protocol came with a guarantee that clients would never send requests belonging to two different origins via the same TCP connection then you could hypothetically do something clever like have the reverse proxy actually hand over the file descriptor for the TCP connection so it doesn’t have to keep round-tripping bytes to another userland process belonging to the site’s owner. Alas HTTP doesn’t make that guarantee, so you can’t.

                                Maybe it’s also a shame that the convention for DNS is to look up an address via an A or AAAA record and then connect to a fixed port number like 22, 80 or 443 on the target; if we had standardised on something more like SRV records with both an address and a port number to connect to being sent back then it’d have been easier to multiplex up to thousands of HTTP serving daemons on one host.

                                1. 1

                                  Resolving symbolic names as (host, port) pairs would recognize that services, not just hosts, have locations and can migrate.

                                2. 2

                                  FYI, on Linux you can use setcap to grant specific binaries access to bind to low numbered ports. Gone are the days of needing to run daemons as root. Once saw a team use iptables port forwarding to get around this, and it actually used a huge amount of cpu.

                                  1. 1

                                    This solution sucks though when the binary you need to permit is java or python or something anyone could write code with. The FreeBSD MAC solution is better because you’re permitting a user.

                                  2. 1

                                    One option for openbsd is using a relayd config for forwarding.

                                    1. 1

                                      I mean, isn’t the actual problem UID 0?

                                      1. 1

                                        I would also love an ecosystem where the resources you use and pay for closely track the resources your app really inevitably uses, rather than the minimum commitment for a lot of stuff being a certain size VM 24 hours a day. Seems like it could make a big difference for super tiny apps especially. At the time Google App Engine came out, I hoped it’d make that the norm; instead, it seems to help, but for relatively few apps. Here are a few things going on that I think affect that:

                                        1. The implementation of multitenancy is the hard part, more than the design. You need security such that you’re perfectly happy to have evil hackers making syscalls to the same kernel as you, and current popular OSes aren’t ready. You need good resource limiting so everyone on the machine can run their app w/acceptable performance, and resource accounting so you can bill folks accurately. VM hypervisors are close enough to achieving all that for AWS to get by, but popular OSes’ user isolation is not.

                                        2. VM packing has already been made to work surprisingly well. Even just medium-sized apps can already measure a lot of their needs in VM-hours (or smaller units of time on GCE), and can use auto-scaling as a packing mechanism. AWS spot instances and GCE preemptible instances help reduce the amount of capacity for VMs that goes unusued. CPU overcommit like in Amazon’s t2 instance family helps pack in VMs for small apps. And a lot of servers are used by huge services that are very good at packing work onto machines. (Those services include e.g. AWS/Google Cloud services your cloud app might depend on.)

                                          A common limiter of apps’ CPU utilization seems to be that some apps are bound by RAM needs, or by storage (capacity or IOPS). But seems like RAM and storage costs have fallen faster than cost of CPU cycles lately, so a common solution seems to be just to have a lot of RAM/disk per core. Also, cloud providers love elastic storage services that make it easy to vary the ratio of disk to other resources.

                                        3. Sub-VM-level packing is happening, to an extent. AWS Lambda, GCE’s Cloud Functions, and App Engine all represent this. A big part of the secret sauce seems to be an app runtime that’s sealed off well enough that packing in different tenants’ workloads on the same box is a real possibility. I see a lot of flack directed at these services from folks who see them as just a way the youngsters can avoid thinking they have to manage servers, but seems like it works well for some folks.

                                        Finally, on the social cost alluded to in the title, keep in mind we have collectively gotten great at shuffling around a lot of bits with a kilowatt-hour of power. If we have RAM and CPUs sitting around idle because we’re using VMs, well, boxes with idle CPUs also aren’t using nearly their max power. It takes a lot of idleness to add up to the social cost of a typical airplane flight(!) or gallon of gasoline used. So I don’t find the assertion in the title especially persuasive, even as an exaggeration to get a point across. (Reminds me of a post a while back suggesting Facebook’s use of PHP was socially irresponsible because of CPU cycles used. Increasing the team size to switch their entire Web app to C++ might have some costs too!) But it is still kind of interesting to explore why that ideal of packing in a bunch of tenants under one kernel and billing everything in an extremely granular way hasn’t come about.

                                        To the side, there are limiters to how much cost savings we customers get even if we fully solve the problem of packing apps onto hardware. Small VMs are pretty cheap already, so whatever we figure out to make them cheaper is probably not going to be life-changing. Providers still have to provision for peak loads–e.g. for a availability zone/region getting a spike in load when another fails, for daytime rather than night–and they’ll charge somewhere to pay for those resources; I wouldn’t call that an inefficiency (those resources are needed eventually) but it can feel like one. And since the cloud services industry is so capital-intensive (in physical infrastructure and tech development), we still potentially pay the kind of markups that can happen when you have a few oligopolists competing for your biz.

                                        1. 2

                                          I actually think AWS lambda is really cool. its a great way to build something with variable load. Mostly I reserve my snickering for the people who think a $10000 / month cluster is the only alternative way to handle a few million requests.

                                          The free tier in particular is cool. Because of the efficiency of hosting lots of idle services, they can afford to let you pay nothing. They’re obviously overcommitted, but hopefully with enough services it averages out ok. If I didn’t already have a paid for server, making the incidental cost for anything new 0, I’d seriously consider lambda.

                                          (For their part, Facebook has gone to great lengths to make their php code run much faster than your average php interpreter.)

                                          1. 2

                                            Yep. I started an App Engine app before I got a server, and it’s still running for free. Love that free tiers exist, and wouldn’t surprise me if the companies recouped the cost in additional paid apps created.

                                            Also agree about Facebook; HHVM and stuff look super smart. Just meant to reference the PHP-shaming post for how this claim was reminiscent of it, not to endorse its premise.

                                        2. 1

                                          One person on another forum just set up iptables port forwarding from 80 to a non-privileged port. Sounds straight-forward for that part of the problem. What do the rest of you do to efficiently address the issues he writes about?

                                            1. 2

                                              authbind I was introduced to a few years ago to solve this, and to avoid having to write yet again the usual boilerplate privilege dance code too.