1. 27

  2. 10

    A lot of these problems are already solved by Plan9 and other related operating systems that allow transparent network access and data processing.

    1. 5

      Not really. Plan 9 is distributed, but not in the sense of high availability or scalability that the author of this article is talking about.

      The tools are there, and there’s some neat research around it, like Eric Van Hensbergen’s push shell (https://citeseerx.ist.psu.edu/viewdoc/download?doi=, but the system, out of the box, doesn’t solve these problems.

      1. 3

        I’d say network access is never transparent. You can’t abstract away performance. In a practical situation (not just “look, it’s possible!”) you need to manage the network explicitly to have control over how you deal with latency, congestion, and partitioning.

        1. 2

          The plan9 system of treating the network like a filesystem does not exclude granular control over the network.

          Nor does a consistent way of accessing foreign network resources (open/close, read/write) exclude performance

          1. 1

            If you’re thinking about performant implementations of open/close and read/write, you’ve already lost performance. Filesystem-like semantics lock you into reliable in-order delivery.

            There are also different priorities for file format design and protocol design. The everything-is-a-file approach conflates the two.

          2. 1

            Well, you can make it “transparent” by treating all access like network access. This would make some local calls more troublesome, but would make it transparent whether it is local or remote call.

        2. 5

          I’ve been thinking about something similar for a while now. Working on it—slowly, very slowly, maybe two decades will pass and it’ll still be vapourware;

          • There is a single ‘blessed’ application runtime for userspace. It is managed and safe. (In the tradition of java, javascript, lua, c#, etc.) This is necessary for some of the later points.

            • As gary bernhardt points out, this can be as fast as or even faster than running native code directly.

            • Since everything is running in ring 0, not only are ‘syscalls’ free, but so is task switching.

              • There is thus no reason to use coroutines over ‘real’ threads.
          • All objects are opaque. That is:

            • All objects are transparently synced to disc.

              • Thus, the ‘database’ doesn’t need to be something special, and you can form queries directly with code (this doesn’t necessarily scale as well, but it can be an option, and you can use DSL only for complex queries if necessary)
            • All objects may transparently be shared between threads (modulo permissioning; see below)

            • All objects may transparently originate in a remote host (in which case changes are not necessarily synced to disc, but are synced to the remote; a la nfs)

              • Network-shared objects can also be transparently upgraded to be shared via a distributed consensus system, like raft.
          • Instead of a file system, there is a ‘root’ object, shared between all processes. Its form is arbitrary.

          • Every ‘thread’ runs in a security domain which is, more or less, a set of permissions (read/write/execute) for every object.

            • A thread can shed permissions at will, and it can spawn a new thread which has fewer permissions than itself, but never gain permissions. There is no setuid escape hatch.

            • However, a thread with few permissions can still send messages to a thread with more permissions.

          • All threads are freely introspectible. They are just objects, associated with a procdir-like object in which all their unshared data reside.

          1. 3

            pst (since I’m the guy who has to point these out to everyone each time):

            • IBM i has a lot of these (object persistence, capabilities, high-level runtime only; older systems didn’t even have unprivileged mode on the CPU), but not all.

            • Domain has network paging (since that’s how it does storage architecture), but not most of the others.

            • Phantom does persistence for basically pausable/restartable computation. Weird, but interestingly adjacent.

            I need to write a blog post about this!

            1. 2

              Interesting! Never encountered that.

              Wiki says it’s proprietary and only works with ppc. Is there any way to play with it without shelling out impressive amounts of $$$ to IBM?

              1. 3

                If you want your own hardware, you can buy a used IBM Power server for an amount on the order of a few hundred dollars and installation media for a trial is available direct from IBM. While that’ll only work for 70 days before you need to reinstall, back up and restore procedures are fairly straightforward.

                If you don’t care about owning the hardware, there’s a public server with free access at https://pub400.com/.

                Whichever route you take, you’ll probably want to join ##ibmi on Freenode because you’ll have a lot of questions as you’re getting started.

                1. 2

                  Is there a particular model you recommend of Power? The Talon stuff is way too pricey.

                  1. 2

                    If you want it to run IBM i, you’re going to need to read a lot of documentation to figure out what to buy, because it’s all proprietary and licensed, and IBM has exactly 0 interest in officially licensing stuff for hobbyists. It also requires special firmware support, and will therefore not run on a Raptor system.

                    I think the current advice is to aim for a Power 5, 6, or 7 server, because they have a good balance of cost, not needing a ton of specialized stuff to configure, and having licenses fixed to the server. (With older machines, you really want to have a 5250 terminal, which would need to be connected using IBM-proprietary twinax cabling. Newer machines have moved to a model where you rent capacity from IBM on your own hardware.)

                    I’d browse ebay for “IBM power server” and looking up the specs and license entitlements for each server you see. Given a serial number, you can look up the license entitlements on IBM’s capacity on demand website. For example, my server is an 8233-E8B with serial number 062F6AP. Plugging that into IBM’s website, you see that I have a POD code and a VET code. You can cross reference those codes with this website to see that I have entitlements for 24 cores and PowerVM Enterprise (even though there are only 18 cores in my server, in theory I could add another processor card to add another 6. I’m given to understand that this is risky and may involve needing to contact IBM sales to get your system working again)

                    You really want something with a PowerVM entitlement, because otherwise you need special IBM disks that are formatted with 520-byte sectors and support the SCSI skip read and skip write commands. You will also need to cross reference your system with the IBM i system map to see what OS versions you can run.

                    Plan to be watching eBay for a while; while you can find decent machines for €300-500, it’s going to take some time for one to show up.

                    Also, I’m still relatively new to this whole field; it’s a very good idea to join ##ibmi on freenode to sanity check any hardware you’re considering buying.

                2. 1

                  There’s no emulator, and I’m not holding my breath for one any time soon.

                  Domain is emulated by MAME, and Phantom runs in most virtualization software though.

                3. 2

                  Hey Calvin, Please write a blog post about this.

                  1. 1

                    Please do

                  2. 3

                    I’ve been working on this but with WebAssembly.

                    1. 1

                      I am curious. Is there source code available?

                      1. 1

                        It’s still in the planning phase, sadly. I only have so much time given it’s one of my many side projects.

                    2. 2

                      You might be interested in this research OS, KeyKOS: http://cap-lore.com/CapTheory/upenn/

                      It has some of what you’re describing: the transparent persistence, and the fine-grained permissions. I think they tried to make IPC cheap. But it still used virtual memory to isolate processes.

                      I think it also had sort of… permissions for CPU time. One type of resource/capability that a process holds is a ticket that entitles it to run for some amount of time (or maybe some number of CPU cycles?). I didn’t really understand that part.

                      1. 3

                        Looks interesting. (And, one of its descendants was still alive in 2013.) But, I think anything depending on virtual memory to do permissioning is bound to fail in this regard.

                        The problem is that IPC can’t just be cheap; it needs to be free.

                        Writing text to a file should be the same kind of expensive as assigning a value to a variable. Calling a function should be the same kind of expensive as creating a process. (Cache miss, maybe. Mispredict, maybe. Interrupt, full TLB flush, and context switch? No way.)

                        Otherwise, you end up in an in-between state where you’re discouraged from taking full advantage of (possibly networked) IPC; because even if it’s cheap, it’ll never be as cheap as a direct function call. By making the distinction opaque (and relying on the OS to smooth it over), you get a more unified interface.

                        One thing I will allow about VM-based security is that it’s much easier to get right. Just look at the exploit list for recent chrome/firefox js engine. Languages like k can be fast when interpreted without JIT, but such languages don’t have wide popularity. Still working on an answer to that. (Perhaps formal verification, a la compcert.)

                        CPU time permissions are an interesting idea, and one to which I haven’t given very much thought. Nominally, you don’t need time permissions as long as you have preemptive multitasking and can renice naughty processes. But there are other concerns like power usage and device lifetime.

                        1. 1

                          I’ve been imagining a system that’s beautiful. It’s a smalltalk with files, not images, with a very simple model. Everything is IPC. If you are on a Linux with network sockets, that socket is like every other method call, every addition, every syscall.

                          Let’s talk. I like your ideas, and think you might like this system in my mind.

                          1. 3

                            These sound great until you try and implement any of it, in which case you realise that now every single call might fail and/or simply never return, or return twice, or return to somebody else, or destroy your process entirely.

                            Not saying it can’t be done, just saying it almost certainly won’t resemble procedural, OO, or functional programming as we know it.

                            Edit: Dave Ackley is looking into this future, and his vision is about the distance from what we do now as I expect: https://www.youtube.com/user/DaveAckley

                            1. 1

                              You might want to read up on distributed objects from NeXT in the early 90s.

                        2. 2

                          Sounds an awful lot like Microsoft Midori. It doesn’t mention transparent object persistence, but much of what you mentioned is there.

                          1. 1

                            This doesn’t solve all of the problems brought up in TFA. The main one is scheduling/autoscale. It is certainly easier—for instance, you can send a function object directly as a message to a thread running on a remote host—but you still have to create some sort of deployment system.

                            1. 1

                              (sorry, replied to wrong comment)

                          2. 4

                            Sounds like most of this is provided as part of https://www.resinstack.io/

                            1. 1

                              Sounds interesting, as it’s still based on Linux and OCI containers.

                              But if I am understanding it correctly, Resin wouldn’t support features such as resource quotas for containers without a Nomad Enterprise, because quotas aren’t supported in the free version?

                              1. 2

                                Sounds interesting, as it’s still based on Linux and OCI containers.

                                Nomad has a really nice plugin interface. So while it might be out of scope for this project, I wonder if a similar approach could be used to create something like this on other projects. pot for FreeBSD already exists. I imagine things like extending exec to support to also work with a capsicumizer style approach. I have recently been playing around with sandboxing using OpenBSD’s pledge and unveil and Go (it’s really straight forward) and I can imagine something where Nomad (the plugin that is) decides what a forked process is allowed to do. On top of that bhyve could be used (Docker uses xhyve on macOS anyways) or vmm or other hypervisors.

                                Also I think Nomad’s plugin architecture would be really interesting for running all sorts of microkernels, Solaris zones or maybe CloudABI or more experimental approaches.

                                So many ideas, so little time.

                                1. 2

                                  resource quotas for individual containers are available in the free version, just not at the namespace level

                              2. 2

                                While I find the larger perspective very compelling (the idea that a cluster is one computer), I think the actual suggestions of “roles” that need to be filled (e.g. cluster wide file system) are too specific and low level ideas from the “old world”. These are ideas from single computers that we’ve tried to scale up but aren’t actually that great. Instead we want to look at more general ideas and high level needs. E.g. consider “persistence and messaging with varying levels of consistency”. We should think of as few systems as possible that can provide this. Otherwise we’re still stuck with the eternal sync and consistency problems. The same thing with “processes” - at what level would we like to describe processes? Why break it up into single machine sized chunks upfront? Is it possible to define end-to-end processes and separately decompose and map that onto the machines?

                                1. 2

                                  Hm interesting, I have some thoughts on the three questions at the end too:

                                  How do we get the power of bash scripting, but for a cluster of 20k machines?

                                  This is what Oil is supposed to be. Hasn’t gotten there yet unfortunately. I felt the single machine shell was in such poor shape that it can’t be built upon for a distributed OS. But I don’t think the two are that far apart, as Unix processes are naturally a shared-nothing system, like a distributed system. The main difference is that the links are unreliable in a distributed system, and you need idempotence, etc.

                                  From over 3.5 years ago: Project Goals and Related Projects

                                  The real motivation for oil is that, from 2010 to 2015, I tried to write a distributed operating system / cluster manager based on my experience with Google’s Borg. The project had a few users, but ultimately didn’t make it very far. Kubernetes has since done better.

                                  It kinda sucks Oil hasn’t made it that far since then, but it is inherently a lot of work, and I think building on flawed foundations would have also failed.

                                  Security models for a DatacenterOS

                                  Although I said Kubernetes has done better than my project, this is the one place where (last I checked) Kubernetes kinda sucks. I’d claim it isn’t a real distributed OS because its auth model often piggybacks on the lower layers. I’d be interested in any arguments otherwise as I’m not completely up to date.

                                  All the YAML in Kubernetes is annoying too, which is why Oil also has / will have config language features. I saw this yesterday and it’s a real problem:


                                  Thinking of writing a talk called “I’m sorry about the YAML” where I talk about how we got here and where it is all going.

                                  Basically all the YAML stuff is a lot worse than what we used at Google. A distributed OS needs a real syntactic foundation (and Oil is supposed to be that). It’s not a small thing.

                                  1. 1

                                    I think the article overstates things slightly. For example, HPC architectures have had most of these components for decades – they’re just less accessible to the public than open infrastructure.

                                    That said, one could have a long career looking at technologies that are out there at Google, inside HPC systems, etc., and polishing them and selling them to general IT organizations.

                                    1. 1

                                      They missed the global level (not cluster level) data storage system.

                                      Where many google properties store their bytes is not in Colossus directly, but through an internal version of Google Cloud Storage.