Kubernetes and security don’t belong in the same sentence. That entire cottage industry is a marketing department larping as a software engineer. While I totally agree your average company is consuming too much software to scale to the traditional unix security model of users/groups/etc. k8s comes in and murders the vm isolation that has existed for the past 20 years. Own a single node in k8s and you now probably have access to all infrastructure - not just the instance you popped. Lateral traversal becomes orders of magnitude easier. Linux has always had major security issues but k8s really rubs me the wrong way and makes the 9x malware scene look like a toy shop.
IMO container isolation is generally “good enough” for most purposes, especially with recent efforts to avoid running as root.
Despite that, though…k8s can multiple compatible container runtimes, including Firecracker which uses VMs or gVisor which uses a userland kernel implementation with a heavy focus on sandboxing.
The whole problem with containers is that they were never designed from a security standpoint. They were designed so you could compile an application with a correct build environment. Then the marketing people started ramping up the false “security” claims and so a lot of people have this very very false notion that they have some sort of security inside them. They don’t. Kubernetes takes this idea and expands it across an infrastructure fleet compounding the problem considerably.
Containers are an inherently broken construct and can not be “fixed” with the addition of so-called devsecops software. All the static and dynamic analysis in the world doesn’t fix it.
Firecracker provides some value although it isn’t appropriate for all workloads. As for gVisor - it’s basically a non-starter unless you have your own hardware or are using nested virtualization or things like nitro.
Great post, very cool project.
It’s crazy how ideas are just floating around.
For the past few months I’ve been thinking about what I call “singe-server web apps” and “single-binary web apps”.
What i’m interested in is simplifying the deployment of web apps on services like Linode or Digital Ocean by just uploading and running a single binary that contains everything needed including the DB and scheduled tasks. I was thinking about using Go with embedded HTML, CSS and images and SQLite as the DB. One could also use something like Litestream to make sure the DB is safe in the event of a major server failure, but that would require a “second server/binary”.
I don’t really know what this is but the concept feels very appealing to me. Kinda reminds me of the PHP days where you just upload a script and just opened the browser. I know PHP still exists but it requires a web server and configuration to run. The idea of a single binary feels even more portable than PHP.
Also https://redbean.dev/ is a very inspiring and interesting project.
CGI would assume that you are in a multi-process environment. Most unikernels are single-process (many are multi-threaded though). CGI would also assume all the usual unix trappings such as users, interpreters, shells, etc.
The most obvious benefit is ease of use. There are no shells or interactivity with unikernels so when you deploy it - it’s either running or it’s not. While you can configure the server there isn’t ssh running where you pop in and do extra configure post-deploy.
Then there is security. CGI quite literally is the canonical “shelling out” mechanism. CGI and the languages that heavily used it in the 90s and mid aughts were fraught with numerous security issues because of this practice. You have to deal with all the input/output sanitization and lots of edge cases.
Then there is performance. CGI is woefully under performant since you have to spawn a new process for it and under more modern systems that use “big data” dealing with heaps in tens of gigabytes that becomes ridiculously bad.
Anyways, very few people actually use ‘cgi’ as it were today. For languages that need to rely on a front-end proxy like nginx (because they are single process, single thread - like ruby, python, node, etc.) they siphon incoming requests off the front-end (nginx) and push it to your app back-end (your actual application).
Unikernels work really well for languages that deal with threads like Go, Rust, Java, etc. They work well for scripting languages too but what I just described above the back-end becomes individual vms instead of worker processes. They basically force the end-user to fit their workloads to the appropriate instance type.
The most obvious benefit is ease of use. There are no shells or interactivity with unikernels so when you deploy it - it’s either running or it’s not. While you can configure the server there isn’t ssh running where you pop in and do extra configure post-deploy.
Isn’t that anti ease-of-use? I like to be able to go in and dig around when something goes wrong.
IMO this makes debugging significantly easier than deploying to a linux system. If I throw on my devops hat and start thinking about all the times pagerduty woke me up at 2am in the morning half the time is spent figuring what process is causing an issue. Something is opening up too many sockets too fast? Open up lsof to figure what process it is. I can’t ssh in because logfiles are overflowing the disk? Now I have to hunt down the cronjob that I didn’t know existed that didn’t have proper log rotation on. In unikernel land there really is only one process so you know which program is causing the issue. Instrument it, ship your logs out and you are going to solve the vast majority of issues quickly.
Also there are other cases where debugging is significantly easier. Since it’s a single process system you can literally clone the vm in production (say a live database), download the image and attach gdb to it and now you not only are going to find the root of the problem but you are going to do so in a way that is non-destructive and doesn’t touch prod.
As an aside the ease-of-use I was referring too was not pertaining to debugging (although that is insanely easy) but to deployment/administration as compared to the dumpster fire of k8s/“cloud native”. Unikernels shift all the administration responsibilities to the cloud of choice. So while you can configure networking/storage to your hearts content you don’t have to actually manage it. Most people don’t understand this point about unikernels - they think a k8s like tool is necessary to use them which is totally not true - this won’t make sense to most people until they actually go deploy them for the first time and then this clicks.
I’d argue that’s a combination of A) familiarity, and B) linux OS having built out robust introspective/investigative tooling over decades.
A unikernel has the advantage that many of those investigative tools are for problems that no longer exist, and the disadvantage that it no longer has those tools baked-in for the problems that it does still have.
EG you don’t have du
, but you also don’t have a local filesystem that can bring down the server when you accidentally fill it with logs/tempfiles.
Most unikernels actually do work with filesystems as most applications that do anything want to talk to one. Databases, webservers, etc. Logging is dealt with mostly 2 ways: 1) You either don’t care and you just rely on what goes out the serial - which is stupid slow so not really a prod method or 2) You do care and you ship it out via syslog to something like elastic, papertrail, splunk, etc.
EG you don’t have du, but you also don’t have a local filesystem that can bring down the server when you accidentally fill it with logs/tempfiles.
So how do you find out that your app e.g. crashes because some database key was NULL?
You have logging and/or crash reporting and you do something useful with them. It’s your problem to do though, but that’s not really any different than deploying to a traditional stack.
OK, but where do the logs go? If they go to another server, which then stores them on a filesystem you’ve just kicked the can elsewhere.
If you wanted to use rust (or see about statically linking in a sqlite VFS via cgo) you could see about using https://GitHub.com/backtrace-labs/verneuil for sqlite replication. Completely in-process and exposes both a rust and C interface. It works quite well for all my home use cases, like for example replicating my blogs sqlite db to a s3 blob store for easier rollbacks.
This has gotten brought up quite a lot in the past few years but every time I check there still seems to be a whole host of issues.
Whoever is actually following the scene can they comment on the following (lack of) support? To me these are absolutely huge blockers to become a new generic app runtime ala k8s:
true threads?
64bit - https://github.com/WebAssembly/memory64
dynamic linking - https://github.com/WebAssembly/tool-conventions/blob/main/DynamicLinking.md
real sockets? - https://github.com/WebAssembly/WASI/pull/312
aslr/stack canaries?
TLS? - https://github.com/bytecodealliance/wasmtime/issues/71
A summary from my knowledge
re: threads: No, shared array buffers and web workers are not what I would consider true threads.
re: 64bit: this effectively rules out all databases, most ‘enterprise’ software and many many other applications such as machine learning applications and other assorted ‘big data’ projects because of memory addressment requirements; also I haven’t deployed any 32bit applications in prod for what 10 years now?
re: aslr /stack canaries: I might’ve just labeled this ‘generic security’ as wasm doesn’t even have the notion of read-only memory - just because it’s in a ‘sandbox’ doesn’t mean anything as is pointed out in this paper: https://www.usenix.org/system/files/sec20-lehmann.pdf
As you pointed out - I guess that’s my point. There are definitely interesting use-cases with WASM and WASI but to answer the parent’s article claim that it is the next k8s or next application run-time - without addressing most of this, which I agree some of these will never be addressed - it is a resounding ‘no’. WASI was supposed to be the host environment but it also fails to address most of this. I suppose one could code up WASI++ and make it a server-side only implementation but then it’s not WASM (or even w/WASI) anymore.
Re address space: it’s not quite as dire as it sounds. WASM does not quite operate in a 32bit address space. Effectively it has a core heap which is indeed bound to 32bits, but typed arrays exist outside of that and are indexed by size, so for example a wasm binary can access the 2^32-1th element of an array of doubles without a problem, and it can have many such typed arrays. That’s arguably useful in some like ML as they could be arranged to exist primarily in GPU memory for example.
The read/write memory issue the usenix paper spends time discussing aren’t super useful as there are very few cases where I have seen any program, security sensitive or not, use to protect the heap.
Most of the other issues in the paper boils down to the host environment blindly trusting the VM. Which is a no go for any containerisation mechanism.
I think a core issue with the usenix paper is that it is trying to pretend that wasm is meant to provide a memory safe environment. But that is by design not what wasm is. Wasm must be able to run code from a non-memory or type safe language, which means that the VM cannot, in the general case, place limits on how people interact with memory.
I do agree that it could have been achieved in a way that was harder to exploit, but the folk who started it were more interested in compiling arbitrary C into a form it could be executed at near native speed, while remaining safe in a browser context, so this is the model we got :-/
Are you saying you can address more than 4 gig of memory? It is my understanding that you can’t and if you can’t that excludes a ton of applications - most importantly databases. I’m not saying that you can’t run other programs but the initial claim in the parent article makes it sound like wasm is going to be the new underlying runtime which is what I don’t agree with because of limitations like this.
which means that the VM cannot, in the general case, place limits on how people interact with memory.
This is a fundamental problem though.
Let me paint a picture for you. You have a function called “is_correct_password” or “is_admin” or “can_rm_rf_this_entire_partition”. The fact that you can’t mark certain pages as read only means I can overwrite that with whatever I want (eg: return true). It is a major security issue and the paper clearly shows real life examples of how dangerous it is. The whole “it is in a sandbox” claim that wasm proponents continuously make makes no sense as everything inside the sandbox - eg: the program logic at large is at risk.
To put it another way if you take an ordinary linux application running as an elf - say a go program, or a “memory-safe” rust program you considerably downgrade the security of it by compiling and running it as wasm.
Are you saying you can address more than 4 gig of memory?
From WebAssembly? Yes. From C code? No. WebAssembly has memory objects. Each memory object is bounds checked. Every memory access is a memory object identifier + some offset (which must be within the size of the memory object). In 32-bit WAsm (the only version fully specified so far) each memory object is limited to a 32-bit size.
C code compiled to WebAssembly uses memory object 0 and lowers pointers to integer values that represent an offset within the allocation. You could probably create a different lowering that used 64-bit integers to encode a memory ID and an offset (though most implementations heavily optimise for accesses in object 0 and so this would probably be slow).
I personally hate this part of WebAssembly. Providing a PDP-11 memory model in 2017 is inexcusable. Especially since this was two years after I’d published the paper showing that most C could be lowered to CHERI with object-granularity spatial memory safety with zero or few changes (we’re currently seeing around 0.02% line of code changes large C/C++ codebases for CHERI - smaller changes than porting to an environment without exceptions, for example).
Let me paint a picture for you. You have a function called “is_correct_password” or “is_admin” or “can_rm_rf_this_entire_partition”. The fact that you can’t mark certain pages as read only means I can overwrite that with whatever I want (eg: return true).
WebAssembly doesn’t allow you to mark pages read only, but it also doesn’t allow you to mark pages as executable. Code, data, and stacks, are completely distinct memory regions in WebAssembly. You cannot modify code, you cannot take the address of the stack (address-taken stack allocations are lowered to allocations within data memory).
There are a lot of legitimate reasons to criticise WebAssembly but this is not one of them.
You can overwrite constants in what traditionally would be read-only. You can overwrite portions of the stack. You can overwrite the heap. You can overwrite function pointers.
In particular with your suggestion that you can’t redirect calls is incorrect. The paper clearly shows examples of them redirecting indirect calls.
A core difference is that you don’t get the blind overwrites you have on a real machine.
A “function pointer” in the wasm runtime is either an index into the global function table, or a funcref. When a function is called through that indirection, the call site must be calling with the correct signature. So not only must your substitute function be an existing entry point, it must also have the correct type. Further limiting the ability to attack said code.
I don’t know what you’re talking about when you say you can overwrite constants that would be traditionally readonly. The obvious one would be global variables, but globals are marked as being mutable or immutable at individual declaration granularity - they aren’t part of the primary memory region.
In the paper they clearly show overwriting read-only data (page 11, figure 9) and they clearly show overwriting function pointers that are implemented as indirect calls (page 11, figure 8).
Figure 9) ok, having read and reread I think I understand. By the VM spec it seems like this really should be read only, however the VM spec also allows the wasm vm instantiator the ability to place data at arbitrary locations, so they can’t be made read only. Coolcoolcool. That’s before you even consider sharing the same region with other vms and js or whatever the host environment is. In an ideal world the spec would mirror at least the basic idea of having an ro text section. In an ideal world it would probably work more similarly to shader languages in which you don’t get a priori knowledge of where in your linear address space things can go - you have to query the vm. That would also allow the vm to do aslr if it did end up being necessary.
That would also let the VM arrange things more sanely and so allow multimapping, etc to work with shared memory.
Figure 8) exactly as I said, they were able to change a call from one function of a specific type, to another function of that same type
Sorry, I am a moron, for some reason I recalled being able to share typed arrays with the VM - I don’t know if that’s something that came up early on in wasm planning land, or something my memory completely screwed :D
For the second problem, your example can’t happen: the wasm vm is a Harvard architecture: program code does not exist in the data heap, and instructions in the wasm VM can only read and write to the data heap.
The WASM VM cannot do run time codegen at the moment, and any future APIs have to consider the safety of the VM which means they’re not going to allow rewriting of existing code in arbitrary ways.
Strictly of course a WASM program could in principle generate new wasm code, pass that to the host, and then have the host instantiate that in a new VM, but I suspect the end performance may be suboptimal :D
[edit: immediately having posted this I realized that I probably mixed up people historically complaining about the address space in JS, because [typed]arrays could (at the time) only have a maximum length of 2^32-1 elements. Obviously this isn’t “4gb” as it takes more than one byte per element*, and of course you can have many such arrays.
So just to be clear - can you point me to an example of a database that needs 16 or 32 gig of ram to store indices in memory - not on disk - as wasm?
I’m not a DB person, but I know many many system that work with large data use MMIO which means you need a large address space, even though you don’t necessarily have sufficient actual memory.
Sure - but when that pages out to disk there goes your performance. Again I’m not talking about keeping a working set in memory. I’m just talking about keeping a large index in memory. If you can’t do that then the database grinds hard.
Yes, but the databases control their access very carefully.
IIRC oracle dbs used to have their own file system so that they could get even better control of caching, etc
Whether that’s still necessary I don’t know - you can get “consumer” hardware with ~200gb+ of ram these days so maybe it’s leas important?
Not sure we’re on the same page here.
I literally can’t spin up, say a mysql database, with more than 4gig of ram using wasm today and expect it to use that ram to store things like indices in memory. So if I have a small database that has 32 gig of ram that’s a no go for wasm today.
I agree - we aren’t
I think you missed my earlier correction (I thought memory blobs were in the spec) and I missed the “in wasm” in your subsequent request for examples :D
true threads?
This is a tricky one. There are some proposals floating around to allow fine-grained memory safety in Wasm. These are very cheap to implement if you don’t allow shared-memory concurrency, much harder if you do (unless you have CHERI hardware). Disallowing pointers in shared buffers (requiring explicit message sending of pointers between workers) would work.
64bit
As far as I’m aware, this is easy but not a priority because it lacks a motivating use case.
dynamic linking
This would bring a lot of problems. What is the scope of sharing? WebAssembly is intrinsically a sandboxed deployment model and a lot of the value comes from having a complete inventory of the code running in a given sandbox.
real sockets?
This is not a limitation of WebAssembly, it’s a bit of WASI that isn’t fully spec’d. An embedding can provide sockets.
aslr/stack canaries?
Stack canaries are less necessary in WAsm because the return address is always stored on a separate stack to any address-taken allocations. ASLR is of limited use these days (there are lots of bypasses that are integrated into exploit toolkits) and it’s also difficult to get a useful amount of entropy in a 32-bit address space. That said, the WAsm program has complete control over its 32-bit address space and so a malloc
implementation can easily add randomisation without needing any extensions to WAsm.
TLS?
Same answer as sockets. An embedding can expose KTLS.
64 bit
The motivating use-case for 64bit is simple. I need to load more than 4gig into memory - that’s pretty much any ‘enterprise’ software project or anything that touches a small amount of data.
dynamic linking
I agree that this is not a huge problem at all, however, I will point out that the vast majority of software (99%+?) is all dynamically linked.
real sockets/TLS
I guess my point here is that while individual programs have had this functionality tacked onto them there is no common spec for it yet and probably won’t be? I linked directly to the issue to the WASI proposal that has been dead for at least a year. Until that extension exists that’s a fairly hard limitation. Most programs that expect use of native sockets and by extension TLS are not setup to delegate that responsibility over to a shim which is how some people are dealing with this now.
The motivating use-case for 64bit is simple. I need to load more than 4gig into memory - that’s pretty much any ‘enterprise’ software project or anything that touches a small amount of data.
There’s a difference between more than 4 GiB of memory in a system and more than 4 GiB of memory in a single security context. I would suggest that any ‘enterprise’ software that puts 4 GiB in a single security context is something that the red team should focus on in their next audit.
I agree that this is not a huge problem at all, however, I will point out that the vast majority of software (99%+?) is all dynamically linked.
This is increasingly untrue for cloud deployments. Increasingly, VMs run a single container, which contains a single process. There is no benefit from dynamic linking and there’s a performance overhead.
? Are you saying I can map more than 4gig of memory in wasm? I was under the distinct impression that is totally impossible right now. As I’ve pointed out almost every single prod db will be doing this and many many java applications will be doing this.
As for statically linking in a container - that is still nowhere close to being prevalent. One small security benefit of dynamic linking is randomization of addresses from the libraries. I stand by my comment that over 99% of software is dynamically linked. If you have actual stats to show otherwise happy to see that.
K8s itself is an evolution on a previous architecture, OpenStack. OpenStack had each container be a full virtual machine, with a
Yeah this is WRONG. Roughly speaking, OpenStack is an open source AWS, while Kubernetes was inspired by Borg (and both came out of Google).
And OpenStack/AWS are based on VMs, while Borg/Kubernetes are based on containers.
What eyberg posted from Hashicorp (a competittor) is slightly biased but generally correct. Google advertises Kubernetes as like Borg but in my opinion it’s worse. (I used Borg for many years at Google across several domains.)
I quote a first hand and informed opinion in this blog post:
http://www.oilshell.org/blog/2021/07/blog-backlog-2.html
MetalLB taught me that it’s not possible to build robust software that integrates with Kubernetes.
GKE SRE taught me that even the foremost Kubernetes experts cannot safely operate Kubernetes at scale.
Armon isn’t stupid. He knows full well openstack is built for vms and k8s is built for containers.
What he is saying here is the vendor/marketing/kitchen sink approach is the same - not the tech.
Armon captures this common sentiment pretty well in this interview:
Armon Dadgar: For me, it comes down to three really simple things. One is just the elegance of experience. Kubernetes is OpenStack 2.0. It’s just as complicated. It’s just as vendor controlled. It’s just as foundation led.
Armon Dadgar: Let’s ignore the usability of Kubernetes, which is a mess. Let’s talk about its actual operational scalability. It’s also a joke. Borg runs on 10 million nodes, Kubernetes falls over if you have a few hundred. And so in what sense did Google learn from Borg when Kubernetes only scales to 1/1000th of what Borg does?
The sentiment is generally right, although it’s not a great source since it’s a competitor.
But a worthwhile correction: A borg cluster doesn’t span 10M nodes. When I left >5 years ago they were 10K to 100K nodes each, and they don’t really talk to each other, and there is a lot of manual “toil” to keep everything up and synchronized and manage downtime. (I think this information is readily available in the Borg paper.)
Still I agree Kubernetes is worse than Borg along many dimensions, and I quoted another blog in a sibling comment giving lots of detail than that.
Unikernels are not a new idea. On Exokernel and Nemesis, they were called ‘processes’. Xen is a direct descendant of Nemesis, so it’s not very surprising that a lot of the unikernel ideas started there. It basically boils down to two things:
Everything running on a modern hypervisor is equivalent to an Exokernel / Nemesis process, which is now rebranded as a unikernel. If you’re using Linux / Windows / *BSD, then the library that you’re bringing along to provide high-level abstractions is providing a rich filesystem (typically, more than one), nested process isolation, multiple users, a generic network stack, and so on. This is a really crappy way of building a program.
Well, at least for this interpretation we don’t want users, multiple processes and such. Part of the argument here is that in 2021 it doesn’t make a ton of sense to be shoe-horning an extra layer of Linux on top of an already existing layer (the cloud) when you are managing thousand of vms.
Well, at least for this interpretation we don’t want users, multiple processes and such.
I’d slightly reframe that as: we shouldn’t pay the complexity cost of users, multiple processes, and such unless we need them for our program.
Part of the argument here is that in 2021 it doesn’t make a ton of sense to be shoe-horning an extra layer of Linux on top of an already existing layer (the cloud) when you are managing thousand of vms.
It’s also that the abstractions are just plain wrong. A modern web app does want users but it doesn’t want UNIX users it wants OAuth users with JWTs not 16-bit UIDs. If does want persistent storage but in the form of a local filesystem but instead in the form of a database or a distributed object store. It definitely does want network access, but it could have a network stack that’s aggressively optimised for its use case (which may be TCP-only with TLS offload, or UDP+QUIC only, for example and may be optimised for connections whose lifetime follows a predictable distribution and so on). The only IPC it’s likely to want is networked IPC because other components in the same distributed system may not be on the same physical machine (though it would like fast one-copy networking if they are). It wants a VCPU abstraction that maps to the available parallelism so that the language runtime / middleware can scale the available concurrency to something efficient, it doesn’t want a scheduler that is unaware of both the VCPU scheduling and the distribution of work (or anything else workload-specific) sitting between it and the hardware.
Might be worth interesting to look at unikernels? One example being MirageOS - I think I remember a talk saying it was so quick that they could freshly boot to their app on every request. Although I’m guessing you wouldn’t want to do this for an OTP-style app, which emphasizes long running processes. Seems like there is a dead project called LING that wanted to do this for .beam files, and there’s a project called rumprun that claims to support Erlang, but activity on it seems low.
We were able to get erlang, along with elixir working with Nanos https://github.com/nanovms/erlang && https://github.com/nanovms/ops-examples/tree/master/elixir and Nanos doesn’t just have daily activity - it has full time engineers working on it. (I’m with the project.)
TempleOS only only runs one app at a time, the logic being that a human can only concentrate at only one thing at a time. There is no need to have multitasking, but sometimes an app would benefit additional processing power, so TempleOS can do multi-core processing by having a master-slave model: The main CPU can control the other CPU’s and hand out tasks to them.
And while we think about crazy OS ideas: Why not run multiple independent kernels - one on every CPU? So while a rouge process could corrupt and take down one kernel the rest of the system would continue working.
TempleOS can do multi-core processing by having a master-slave model: The main CPU can control the other CPU’s and hand out tasks to them.
Classic Mac OS did this.
And while we think about crazy OS ideas: Why not run multiple independent kernels - one on every CPU? So while a rouge process could corrupt and take down one kernel the rest of the system would continue working.
This kinda exists already; galaxies in VMS achieve virtualization this way, I believe. Running multple OS instances could be useful for the Erlang OS mentioned though, especially since the concepts could make it transparent.
Classic Mac OS did this.
I forgot that the MDD dual G4 models could still boot Mac OS 9. I’m reasonably certain those shipped after OS X came out of beta, though, and could only boot OS 9 because it was (only slightly) too early to stop that.
Were there other multi CPU macs that booted classic Mac OS?
They made a bunch of SMP addons and even systems in the 90s. It was pretty much entirely to speed up gaussian blurs in Photoshop.
You are describing unikernels.
From my perspective it’s less about the human and more about the fact that most companies don’t run one computer or even one database - they run thousands. We are long past the “one operating system / computer” phase. Even the smallest companies are load balancing their webservers amongst multiple vms. We need new operating systems to facilitate this.
I think on a personal level, nobody has only one device anymore (desktop/laptop/phone/tablet/smartwatch/e-reader/smart-TV/home-automation-stuff/home-server-possibly) and we need a good unified system for handling this, instead of pretending they’re all islands that just happen to communicate sometimes, with integration an afterthought.
Why not run multiple independent kernels - one on every CPU?
This has been explored in research. Check out http://www.barrelfish.org/
And while we think about crazy OS ideas: Why not run multiple independent kernels - one on every CPU? So while a rouge process could corrupt and take down one kernel the rest of the system would continue working.
Check out rump kernels in NetBSD probably not same idea but it can be achieved
And while we think about crazy OS ideas: Why not run multiple independent kernels - one on every CPU? So while a rouge process could corrupt and take down one kernel the rest of the system would continue working.
HydrOS did this (and it was a BEAM OS).
The first thing that comes to my mind is to use this as the substrate for a securely sandboxed, distributed social-software platform, i.e. something like Urbit without the Nazi problem and baked-in feudalism. (And, for that matter, the willfully impenetrable custom programming language.)
Web Assembly has various security issues such as the lack of read-only memory that I don’t see being fixed anytime in the near future: https://www.usenix.org/system/files/sec20-lehmann.pdf .
Interesting. But those are vulnerabilities within a WASM process, which would not be as much of a problem for a compartmentalized design like Lunatic.
That doesn’t make any sense. Of course it’s an issue and it’s not compartmentalized if the compartmentalization is itself broken because of this problem. The sand castle works until the tide washes it away.
I’m assuming the issues described in that paper would allow exploits of buggy WASM code, that could allow an attacker to run their own WASM code in the process. But that they wouldn’t let the attacker break out of the WASM sandbox itself (because that would be a huge problem that would have to be fixed.)
So: worst case is that one process within a Lunatic system is compromised. But the worst it can do is read incoming messages and send incorrect outgoing ones. Which might be a problem but not nearly as bad as in a non-compatmentalized design.
No that’s incorrect. It completely allows breakout of the “sandbox”.
How you are sandboxing that “one process”? What security controls are you relying on? Please describe in detail. You are going to need to provide some evidence on what exactly WASM is providing here. I think that linked paper is extremely clear on what is missing.
Normal linux programs have a host of security controls like ASLR and read-only memory that limit what an attacker can do with vulnerable memory. If those controls are missing like they are in WASM than everything is hosed.
WASM is a regressive security nightmare and is fundamentally broken.
I find Docker funny, because it’s an admission of defeat: portability is a lie, and dependencies are unmanageable. Installing dependencies on your own OS is a lost battle, so you install a whole new OS instead. The OS is too fragile to be changed, so a complete reinstall is now a natural part of the workflow. It’s “works on my machine” taken to the conclusion: you ship the machine then.
We got here because dependency management for C libraries is terrible and completely inadequate for today’s development practices. I also think Docker is a bit overkill, but I don’t think this situation can be remedied with anything short of NixOS or unikernels.
I place more of the blame on just how bad dynamic language packaging is (pip, npm), intersected with how bad most distributions butcher their native packages for those same dynamic languages. The rule of thumb in several communities seems to be a recommendation to avoid using native packages altogether.
Imagine if instead static compilation was more common (or even just better packaging norms for most languages), and if we had better OS level sandboxing support!
I don’t think npm is problematic to Docker levels. It always supported project-specific dependencies.
Python OTOH is old enough that by default (if you don’t patch it with pipenv
) it expects to use a shared system-global directory for all dependencies. This setup made sense when hard drive space was precious and computers were off-line. Plus the whole v2/v3 thing happened.
by default (if you don’t patch it with pipenv)
pipenv
is…controversial.
It also is not the sole way to accomplish what you want (isolated environments, which are called “virtual environments” in Python; pipenv
does not provide that, it provides a hopefully-more-convenient interface to the thing that actually provides that).
Yes, unikernels and “os as static lib” seem the sensible way forward from here to me, also. I don’t know why it never caught on.
People with way more experience than me on the subject have made a strong point about debuggability. Also, existing software and libraries make assumptions about the filesystem and other things that are not immediately available on unikernels being there, and rewriting them to be reusable on unikernels is not an easy task. I’m also not sure about the state of the tooling for deploying unikernels.
Right now it’s an uphill battle, but I think we’re just a couple years away and we’ll get there eventually.
Painfully easy to debug with GDB: https://nanovms.com/dev/tutorials/debugging-nanos-unikernels-with-gdb-and-ops - Bryan is full of FUD
GDB being there is great!
Now you also might want lsof, netstat, strace, iostat, ltrace… all the tools which exist for telling you what’s going on in the application to kernel interface are now gone because the application is the kernel. Those interfaces are subroutine calls or queues instead.
It’s not insurmountable but you do need to recreate all of these things, no? And they won’t be identical to what people are used to.
I guess the upside is that making dtrace or an analogue of it in unikernel land is prolly easier than it was in split kernel userspace land: there’s only one address space in which you need to hot patch code. :)
It obviously wouldn’t be “identical to what people are used to” though, that’s kind of the point. And you don’t want a narrow slice of a full linux system with just the syscalls you use compiled in, it’d be a completely different and much simpler system designed without having to constantly jump up and down between privelege levels, which would make a regular debugger a lot more effective to track a wider array of things than it can now while living in the user layer of a full OS.
Perhaps some tools you’d put in as plugins but most of the output from these tools would be better off being exported through whatever you want to use for observability (such as prometheus). One thing that confuses a ton of people is that they are expecting to deal with a full blown general purpose operating system which it isn’t. For example if you take your lsof example - suppose I’m trying to figure out what port is tied to what process - well in this case you already know cause there’s only one.
As for things like strace - we actually already did implement something similar a year or so ago as it was vital to figure out what applications were doing what. We also have ftrace like functionality too.
Finally, as for tool parity you are right if all you are using is Linux then everything should be relatively the same, but if you jump between say osx and linux you’ll find quite a few different flags or different names.
Can you further clarify? With your distribution’s package manager and pkg-config development in C and C++ seems fine. I could see docker being more of a thing on Windows with C libraries because package management isn’t really a thing on that OS (although msys seems like it has pacman which is nice). Also wouldn’t you use the same C library dependency management inside the container?
Funny enough, we are using docker at work for non-C languages (dotnet/mono).
That’s exactly what I said at work when we began Dockerization of our services. “We just concluded that dependency management is impossible, so we may as well hermetically seal everything into a container.” It’s sad that we’re here, but there are several reasons both technical and business related why I see containerization as being useful for us at $WORK.
Which is what we used to do back in the 70s and 80s. Then operating systems started providing a common set of interfaces so you could run multiple programs safe from each other (in theory), then too many holes started opening up and programs relying on specific global shared libs/state which would clash, and too many assumptions about global filesystem layout, and now we’ve got yet another re-implementation of the original idea, just stacked atop of and wrapped around the old, crud piling up around us comprised of yak hair, old buffer overflows, and decisions made when megabytes of storage were our most precious resource.
What if I told you that you don’t need an os at all in your docker container? You can, and probably should, strip it down to the minimal dependencies required.
I’ve seen this meme before :)
https://www.reddit.com/r/ProgrammerHumor/comments/cw58z7/it_works_on_my_machine/
This analysis is far too zealous.
They include images which openly advertise themselves as cryptocoin miners or as hacking tools, because if someone downloads one of those images and runs it without authorisation in a corporate environment, the image is being used for nefarious purposes.
Sure, companies might want to ban the kannix/monero-miner
image, but that doesn’t mean the image has a vulnerability. Does Firefox have a “critical vulnerability” because some companies don’t let people install their own web browser?
Really?
We see one of these new reports almost monthly. Sometimes more than each month.
This one includes:
“Analyzing all 6,432 malicious / potentially harmful container images is a daunting task.”
“The first example of a trojanized application can be found in a container image qiscus123/qiscus-wp-2.36Built upon WordPress, the webshell is disguised under a WordPress SEO plugin Yoast: …. Upon closer inspection, it turns out to be a classic WSO web shell (Web Shell By Orb):”
“Another example of a trojanized application can be found in a container image heroicjokester/tomcat.37 … As seen in its code, it provides a reverse shell on port 4334:”
“In the final example, a container image adminkalhatti/kl-jenkins39 in … Apart from Jenkins, the image also has several instances of XMRig cryptominer pre-installed in the following “
“Container image eternity18/ez is one such example. ….. Its index file /var/www/html/index.html contains a malicious VBS script that drops Ramnit42 – a backdoor designed for Windows systems:”
I understand there are people that like docker and kube for various reasons, but you can’t say that ecosystem is just not filled with malware.
This is the, imo, more interesting pdf direct link:
https://prevasio.com/static/web/viewer.html?file=/static/Red_Kangaroo.pdf
From the PDF, these claims are…something?
Security industry is already raising concerns that proliferation of GoLang, file-less code and Powershell into the world of malware is the most unwelcome development over the recent years
Later on, the explanation about why Go, .NET, etc. are a security problem is because they’re cross-platform which allegedly makes them more attractive for attackers because of the idea of write once, run anywhere. Yet, most Docker images are packaged for typically a single platform from what I know. This feels like they’re reaching for a way to sow fear.
The proliferation of C is like a plague, transmitting via the hands of our students, viciously infesting all of our machines with Unix.
Firecracker is a fast, lightweight virtualization tool. It was open sourced by Amazon and is part of the stack that lets AWS Lambda run tons of tiny “functions” in isolation from each other on a single server.
You still need an OS underneath, since the virtual machine needs a host to provide storage, networking, etc. (Not to mention the filesystem images normally used under Firecracker are very minimal, so you can’t really develop or debug in that environment.)
The more reasonable question is probably, “why use Firecracker instead of Docker?” In that case: Firecracker gives you much stronger security boundaries between “micro VMs” than Docker does between containers. If you don’t need that (i.e., all your containers are running code you trust) then it almost certainly isn’t worth the learning curve to go with Firecracker.
Both gVisor and firecracker are “slow” but useable and in this case, being a raspberry pi I’m going to guess that performance is not a huge limiting factor. gVisor can be ridiculously slow in many cases - they used to talk about performance penalties for “syscall heavy applications”, which is just a ridiculous statement to make. (Looks as if they’ve updated their perf docs recently: https://gvisor.dev/docs/architecture_guide/performance/ ) . Firecracker even though billed as ‘fast’ trades a slower run-time for a faster boot time although it appears there are plans to fix that.
Having said that docker is unsafe at any speed. The claims of isolation in that ecosystem are kinda like the marketing claims from database companies that write to /dev/null.
Interesting project, but this marketing BS is not a good link for Lobsters. Here are the very basic questions I had about Nanos:
To answer all but the last of these, I had to click around and find the github repo. The FAQ on this marketing site is pretty vacuous. “Syscalls: N/A” lol.
Anyway, I think I’ll stick with https://mirage.io/ for my unikernel needs. But thanks for sharing!
Anyway, I think I’ll stick with https://mirage.io/ for my unikernel needs.
I only just looked into this stuff but… all of MirageOS’s peers are dead. So I think I welcome anyone new in the space. We need progress and competition!!!
Good point, but it perhaps says something about the unikernel concept itself and how it’s played out in practice. Overall, I agree, and I’m happy to see Other People’s Money being spent on a technical subject I personally find interesting. I sincerely wish the nanos folks all kinds of success.
At the same time, I must admit to having some doubts about the wisdom of implementing a unikernel with a completely fresh code base in a famously unsafe language, but hey, I guess we’ll see how it plays out.
The money point is important to point out. It’s not something you can just whip together in a weekend and not something I could see happening without lots of $$ to employ full time engineers on. Even then the difference between hello world and production usability is a huge gulf to cross - most teams unfortunately die before they can cross it. That in my most humble opinion is the biggest problem in the ecosystem. Government grants can only take you so far.
I hear your complaint against c. We’ve discussed doing other complementary projects in rust but the pkg/dep system is a major turn off and none of our engineers speak rust either other than hello world projects. Likewise $$ again is a prime concern for working in that ecosystem. When you have tiny little saas companies that raise tens of millions and employ tens to hundreds of software engineers you have to ask what is the true cost of a new language or a new operating system?
All very cogent points, and be it far from me to armchair-quarterback anyone’s business strategy. Application support is probably the first priority for a real-world unikernel, and that drags in legacy APIs and other ecosystem factors that we can’t really control. Latent security problems only become critical when you actually have some customers.
I’ve had a corner of an eye on this space for a little while, but I wouldn’t claim any expertise. To me, it seems like Mirage has carved out a bit of a niche and, like many open source projects, is sort of puttering along just fine with little in the way of sponsorship. But its adoption is of course limited to the (relatively tiny) Ocaml ecosystem. Then there’s the rump kernel approach, which largely piggybacks on NetBSD’s flexibility and mature-ish code base. Looks like there’s some consultancies promoting that. Probably works great for some applications. Doesn’t appear to need VC, at a glance.
The only unikernel I have had any real first-hand experience with is HalVM. I can assure our readers that no realistic amount of government grants or eager first customers could have saved it from drowning in its own space leaks. Haskell itself is largely subsidized by the academic tenure system, and is thus well insulated from its own practical failings, but that’s another story entirely.
Agree with most of this, however, Antti spent something like well over a decade fleshing out rump - and that was on top of netbsd which was forked 27 years ago so I wouldn’t agree that the ecosystem doesn’t need financial resources. The EU is trying to pump some money into the ecosystem thankfully. There indeed are a few consultancies but they tend to come and go (again lack of resources) - I keep a running list.
in a famously unsafe language
Oof. I only just noticed it’s in C. Okay, I’ll root for it, but hopefully they’re making up for that handicap.
This site is a WIP - literally went up yesterday.
Also, this is not intended to be a marketing site - it’s a community site. I was curious if the lobsters crowd would cast it as such but since I have seen plenty of other sites like ziglang with ‘donate now’ buttons figured one link on a community site wouldn’t be ‘marketing’.
As for the questions:
If it’s a WIP, why did you post it here?
A donate button doesn’t make a marketing site, but there’s almost no substantial content — barely a synopsis — and it is indeed mostly trying to convince you to use it. That’s marketing.
Are WIPs not allowed to be posted to Lobsters? Zig is a WIP, it’s at 0.6.0. Can we not link to a Zig release page?
No, of course it’s fine; the point is someone said “weird that this page is missing these basic things” and the author replied “the site is a WIP, it only went up yesterday” to explain why these basic things are missing. Both are talking about the site, not the software itself, because they’re talking about the submission, not the product, which is a distinction so many people in this thread can seemingly not make.
Lobsters is about discussing submissions, right? If you feel the need to defend your submission by saying “it’s a WIP, it only went up yesterday” when someone points out key information is missing, I don’t think you should have posted it yet. What’s the rush?
If it is not appropriate I apologize, although, I feel lobste.rs should post a guidelines document as I see a lot of links posted that have commercial links back and I can’t think of a single OSS project that is commercially supported that doesn’t have a link back.
Just now I found the following on lobste.rs - all that had more than one link going to a commercial post:
https://blog.cloudflare.com/unimog-cloudflares-edge-load-balancer/ https://android-developers.googleblog.com/2020/09/android11-final-release.html https://blog.ipfs.io/2020-09-08-nix-ipfs-milestone-1/
Like I said, the donate button isn’t what matters, and the examples you’ve given serve my point: they are chock full of technical content and detail, which is what we want to see here. This submission has almost no substance.
A new operating system and a new file system aren’t technical enough?
This “submission” is backed up by multiple repositories of open source code:
https://github.com/nanovms/nanos
Where is this aggression coming from?
There’s sincerely no aggression; you’re asking about what’s appropriate and I’m doing my best to help you understand why (from my point of view) I feel this isn’t an appropriate submission. I think my opinion is at least somewhat representative of community norms based on votes. I’m not attacking you, I’m sharing my context. You don’t have to agree with my assessment.
The page is light on detail and mostly serves to advertise. That’s what it comes down to. I don’t think this submission adds anything. The direct repo link would always be better, or a technical detail post.
And yes, you posted the repo a little while ago, and that’s okay. If there’s a new release, link to the patch notes instead. Reposting the same link isn’t forbidden either, especially if there’s been big changes.
I think a lot of people would find these comments aggressive - maybe you can disclose your real name and who you work for. :) (That’s ok, you don’t have to.)
You are most definitely correct that it does to advertise the community aspect of the operating system. I’m sorry if you are looking for a corporate only POV. I won’t advertise it here but you can easily find it.
I’m glad you don’t find patch notes offensive - I’d find any patch notes offensive that are offensive. Nanos.org is a brand new site so sorry no new changes.
I actually found your comments to be on the aggressive side, while others tried to explain their views on why your post is flagged. And the length of their posts implies it’s a sincere efforts.
maybe you can disclose your real name and who you work for.
This is in bad faith. Additionally, my real name and employer are trivial to locate. I really think you should reevaluate your angle here.
I’m sorry if you are looking for a corporate only POV.
I have to believe you’re being intentionally obtuse in light of my suggesting “the direct repo link” or “a technical detail post”. I’ve really tried only to be kind in my comments here and represent my views to get to a shared understanding with you, but you seem to consistently engage in hostility and points raised while further stirring things up.
Where is this aggression coming from?
There seems to be a misunderstanding here. I think absolutely nobody is objecting to submitting Nanos, that would be ridiculous. People are objecting (although I disagree) to submitting the URL https://nanos.org/, instead of (more relevant) URL https://github.com/nanovms/nanos.
I hope this makes things clear.
The nanos.org website, is strictly for providing a place for the community to place a voice outside of the company. One of the reasons of having a community site was to move a lot of knowledge that was in engineers heads to the community in the hopes that it would not be lost.
The very fact that it is a .org and not a .com, should be painfully clear but in 2020 I realize that is just not something that works with everyone.
Just as kernel.org is an .org, yet retains logos from Fastly, Packet, Redhat, Google and many others, nanos.org is an open source site with source code found elsewhere.
Is this such a problem?
Submitting a site structured like kernel.org would indeed meet some annoyance, because there is nothing to read on this page, and not clear which of the links you considered most worth following. You could pick one previously unposted of the many pages you have mentioned with technical substance and say in the comments that «We are currently working on building a community site on nanos.org and making the linked material available (or updating it) was a part of that work». That would look in a different way.
Note that a page that looks like a pure advertisement for a community is still pure advertisement, even if of a slightly different kind compared to commercial advertisement.
I guess you could imagine the following imaginary (or is it?) use case for Lobste.rs: automatically download the linked articles and separately comments automatically and read them offline. If changing what you link to and mentioning the original link in the first comment would increase the value for such a use case and not clearly decrease the value for the more typical use, it is likely that such a replacement would also improve the reception here.
Submitting a site structured like kernel.org would indeed meet some annoyance
Are you kidding me?
I guess I was wrong in posting a free/open source community based website to lobste.rs - I’ll refrain from that in the future. So much hostility.
Strictly speaking, you were submitting a page, submissions are not really read as sites — there are surely many good pages on your site to submit.
I still don’t understand the frustration here. Is the main page not the most direct way to announce a public free open source tool that did not have it before?
If you had never heard of rust before aside from a few coworkers chatting about it and rust-lang.org didn’t exist but then it pops out of existence you wouldn’t find it appropriate to post?
rust-lang.org goes in somewhat more details than the current state of nanos.org, and I guess a more detailed page about what choices Rust makes would be a better choice than the main page.
Many of us want to click the link and find a page containing a text about something technical that directly convinces us to care. Currently nanos.org tells me «Nanos is a unikernel» and nothing else (sure, almost every unikernel will mention that avoiding context switches will improve performance in read()-bottlenecked tests), then there are some links. I can easily understand people preferring to get a submission where the main link goes to a substantial text, not to a place where you need to find out which link to click to get the text. (I personally did not flag this submission, though)
I 100% agree there should be more detailed documentation and there most definitely will be, it’s a work in progress.
I suppose all the frustration can be summed up with “come back when there is X bytes of technical documentation and link to that rather than the front page”. Noted.
Literally nothing about the link you used for the submission is technical at all. It’s a marketing splash page.
Literally? Nothing?
https://nanos.org/thebook isn’t technical at all? After one day? With links to pull requests of code? and binary serialization formats for a non extX filesystem?
Please.
This community has become so fucking toxic! It’s really sad that I now no longer enjoy reading the comments. I use to come here for the fun technical discussions. But once again the internet has ruined another great website. jcs should have kept everything invite only and small. No wonder he gave up and walked away…
I think what you’re witnessing is the site adjusting to a course correction toward its original purpose. A few people were making a push toward getting more ‘culture’ related content on the front page, and that’s not what this site has ever been about. Some folks have been pushing back.
Fortunately, in the last few months I’ve seen more interesting material hitting the front page than in recent years, and the comments are still well worth exploring (in those threads). Anything resembling “toxicity” comes in ‘culture’ tagged articles.
I was surprised to see it uses lwIP as a network stack. Contrary to my preconception, maybe lwIP is fast enough?
Full disclosure: there’s (very little) code of mine in nanos, but I haven’t worked on it/used it in about 1.5 years now – I wanted to but there just weren’t enough hours in a day for it :(.
Back when I used it, IwIP wasn’t super-fast, but it was pretty easy to work with and pretty flexible. I haven’t ran any serious benchmarks (i.e. on dedicated networking/serving hardware), but I did run some basic tests on some of my development machines and it wasn’t dog-slow. I don’t expect it would be useless for production workloads, albeit there’s likely quite some room for improvement.
Thanks for your contributions! Perf was not top of mind at that point in time and is actually one reason why nanos.org didn’t exist back then but a few of the ft’s have made some serious progress. When brian added ftrace it really revealed a lot that was hard to see before.
FWIW, I think lwIP really is the best choice here, and as I mentioned above, I don’t doubt that it’s adequate for production workloads. There’s always room for improving any networking stack, but lwIP itself is the result of a lot of super solid work. Coming up with something better than it is, in and of itself, a project of considerable magnitude, comparable to developing a unikernel itself!
I think it gets a bad reputation because it’s used in tons of bad firmware, often running on top of bad hardware. I haven’t done it myself, but I know of at least one case where poor performance was traced not to lwIP, but a buggy Ethernet MAC implementation in the FPGA below it.
wasn’t super-fast, but it was pretty easy to work with and pretty flexible.
This is fantastic reminder that often we need to balance usability (and ease of use in general) with other things. Kudos for being honest and open about trade-offs.
Indeed! See my other reply for some additional details – like I said, my involvement with nanos has been minimal (a few weeks?) and it was more than an year ago, so it’s not really “my” trade-off to comment about (and I know things have improved considerably – see the parent of that post of mine). But in general – and, for what it’s worth, in this case, too – I think this is a very valid trade-off to make. lwIP offers adequate performance for many (most?) of the things for which you’d use a unikernel running on virtualized hardware, and it’s pretty solid and easy to integrate. It takes years for a network stack of any performance level to reach lwIP’s maturity, doing that in the early stages of building a new kernel would be both unmanageable and pretty much useless.
lwIP was chosen for speed of adoption not necessarily for anything else - I fully expect us to evolve to something else in the future
I don’t really know how to classify a network stack as fast, but I’ve worked with LwIP and we’ve pushed it to more than gigabit speeds in synthetic benchmarks (although we weren’t benchmarking LwIP but rather some other code around it).
Nanos looks very interesting, thanks for sharing!
I don’t get why some folks are annoyed about you posting a link to the site, it gives me a good idea about what Nanos and the ops
tool do, at a glance.
Really? I have literally no idea what Nanos is or does except “runs applications” on a unikernel. Would you mind sharing what it is Nanos actually does?
Mind sharing your good idea? What does Nanos do, and what does the ops tool do? I couldn’t tell, at a glance. I didn’t find “run code faster than the speed of light” to be informative, the FAQ didn’t say much, and all 600-ish words of “the book” tell me it’s a unikernal that’s opinionated about security but doesn’t say what it’s good for or why I might want to use it.
I could tell from “getting started” that they wanted me to pipe some script from curl to sh, but I didn’t want to do that, especially before I understood what those things do and why I might want to use them.
Nanos is not Linux (it is an independent implementation, so for example, as noted above, it uses lwIP as a network stack, not the Linux network stack), but it can run Linux binaries, by emulating Linux’s ABI and syscall interface. It claims to run Linux binaries better than Linux in some sense, for example by being faster. Nanos probably can’t ever be better in Linux binary compatibility than Linux, but it claims to be good enough.
But it claims it can’t run on bare hardware, right, so I’d need to run it on top of KVM or Xen, which means I’d need to run Linux also? Even just to fire up a VM on VMWare workstation and try it out?
And “the book” claims it has no syscalls…
How is it faster than running Linux binaries than Linux is, if I need to fire up KVM or Xen in order to bootstrap it? Why wouldn’t I just statically link my binaries and directly run them on the host, sandboxed?
I’ve spent 5 minutes looking at the github site, which is a little more helpful than the site OP linked, but I’m still not sure what my use case is for this yet, even though it does sound interesting.
Let’s say you were running Linux on AWS virtualized (very common scenario). Since you weren’t running on bare hardware anyway, there is no disadvantage to switching to Nanos on that scenario, from that reason.
It has no syscalls in a sense that there is no transition from user mode to kernel mode. Your code runs in kernel mode, and Linux syscalls are emulated as function calls.
I haven’t done benchmarks myself, but it is very believable Nanos runs some Linux binaries faster than Linux, since it does less than Linux. As I noted above, there is no syscall transition, which is often a significant overhead. There is also no users and no processes: you are the only user, and your code is the only process. This does greatly simplify what the “kernel” (quoted, since technically your code is kernel) needs to do.
We actually do keep CPL for no other reason that if you have certain pages that are say only readable you don’t want an attacker to make them writeable (and that still gives us a speed boost). Context switching can imply a few different scenarios depending on context (heh!). Kernel -> User, kthread to kthread, user -> user, etc. We’ve found that a lot of the heavy switching is actually a result of the fact that modern GPOS like linux have hundreds of user processes upon boot even if the intention is to only run a single application (which is very common in server-side deployments).
Thanks for the correction.
Since you are here, I want to ask why should I run Nanos instead of OSv. Since two seem to be in a similar niche, a good comparison would be welcome.
First off, I very much respect the OSv engineers, however, the corporate side of that has moved on to https://www.scylladb.com/ which makes governance a bit of an issue (for us). There are a handful of architectural differences. We have strived to be as binary compatible towards linux as possible (elf loading, using whatever libc the app is linked to, etc.) We also approached the niche from a security POV, versus the performance view so that’s where we’ve focused. It should be noted that all of these are very much operating systems just not general purpose operating systems and so each one has a lot of engineer hours that go into it. So each system excels in various areas where others might be deficient. This is compounded by the fact that most unikernel enthusiasts are motivated by different goals/needs (NFV, edge, performance, security, etc.).
I think where I’m getting lost is looking at the proposed getting started workflow captured in this image:
https://nanos.org/static/img/terminal-2.png
That looks to me like it’s running on a linux host, and I think that is contributing to my confusion, if the real idea is that I should be replacing Linux with this on some public cloud hypervisor. Which leaves me needing to go understand the ops
tool, I think, in order to assess whether this might fit into my workflow or offer any benefit.
Thanks for jumping in and explaining… I hadn’t spotted where in the stack this fits, and that’s helpful.
The image indeed shows running on Linux host. As I understand, it is for convenience, and it launches KVM behind the scene.
The real idea is indeed replacing Linux with Nanos on public cloud. This chapter on how to create AWS image should make things clearer.
The ops tool is complementary. You can envision it as something comparable to terraform/chef/puppet (but not really). The idea is that most end-users provision their web application software (eg: all websites) to various public cloud providers which we support every single popular one. There’s a lot of api involved that is not proper to put into the core kernel nanos and if you dive into nanos further you’ll find things like the filesystem manifest which for real applications can get rather large (think jvm or python applications). Ops also helps in this regard by generating that on demand.
ops can be ran on linux and mac today and we have someone working on windows as we speak.
It very much does have syscalls. That section is empty because it needs to be filled out. There are some syscalls we support 100%. Some are stubbed out (on purpose). Some are implemented differently.
You are right it most definitely is not written for bare metal. It is written for most production web application environments which run predominately on the public cloud (aws, gcloud, etc.) which use 2 layers of linux. One for the hypervisor layer and one for guest. This replaces the guest layer.
Thanks. That clarifies it some. I think I need to understand some details about the ops tool in order to evaluate further. If you’re working on the site, that might be a good thing to expand on.
One of the whole reasons for the site is to provide more documentation that would be hard to grok from the code alone.
Nanos is meant to be a server-side only system of running linux applications with a focus on performance and security. For the install script - it points to https://github.com/nanovms/ops/blob/master/install.sh which is viewable/editable on github. You don’t have to use it and can opt to build from source instead.
I think I understood that the intention of the site is to provide info that would be hard to absorb from just looking at a git repository. The feedback I was offering is that the “what” and “why” information that might make me want to try it out isn’t robust enough for me to understand yet, and the “getting started” information that is on the site leaves me needing to go pick through a git repository and read the source code for an install script/orchestration tool in order to feel comfortable that I understand what it claims it will do before I run it.
I fully agree that we need/want more documentation/information - that’s the whole purpose of the site. We have decent documentation for OPS https://nanovms.gitbook.io/ops/ that is fully PR’able on github but that provides no justice for Nanos itself which is way more technical and since OPS isn’t the only horse in town we wanted a central place to detail nanos much like a rust-lang.org .
Nanos serves static content almost twice as fast as Linux
Tested with Go (net/http)
If you’re testing static content, is there a reason to use something other than nginx?
We have nginx and many many other applications available as packages. (think apt-get).
The focus on go is that we have a lot of random infrastructure tooling written in go and so is probably the best supported. Both ops.city and nanos.org are go unikernels running nanos. ops has been up since more than a year now?
We have a project to do automated performance regression analysis and reporting so it’s our hope that this becomes extremely transparent/reproducible in the future.
Why so narrow-minded? https://en.wikipedia.org/wiki/Quantum_dot_cellular_automaton (for the humor-deficient among us, I’d like to clarify: that’s “humor”)
The security arguments against static linking aren’t about managing vulnerabilities. It’s about things like the lack of ASLR:
https://www.leviathansecurity.com/blog/aslr-protection-for-statically-linked-executables
You can do ASLR / PIE executables with statically linked programs. According to this article, it’s statically linked glibc that’s the issue, not statically linked programs in general. Here’s a proof of concept of statically linked PIE executables with zig. It didn’t land upstream yet, but it works fine.
It’s silly to assume Linux.
There’s great options such as Netbsd rump kernels, to build unikernels on.
It’s on tens of millions of machines, powers the clouds, and who knows how many smartphones. It makes sense to build on such momentum. In contrast, building risky projects on other projects with low popularity usually leads to the new project not going anywhere.
Partly marketing. The massive codebase, API, drivers, and Linux-compatible code are definitely technical reasons to build on it. Even vendors of separation kernels all did Linux VMs for such reasons.
I think it is worthy to remember that a lot of unikernel stuff current in the wild is not Linux. Rumpkernel, mirage, and others are pursuing different approaches and they are being successful. Unikernels are very specialized and sometimes the things you listed as advantages are not where the focus of a given project it. It might be the case that mirage is more attractive to a project than some linux based solution.
I think this point is under-appreciated. There are something like over 10 unikernel projects out there and almost every single research paper on them (which is not a small set) forks/tweaks existing projects into new ones.
they are being successful.
I think this would be the main point of contention. The stuff I mentioned has massive uptake, reuse, contributions, and tooling. Unikernels themselves are barely a blip on the radar in comparison. The projects using them probably have nowhere near as much revenue and contributions as Linux-based products and projects. Successful seems a stretch unless they’re aiming for gradual progress in their tiny niche. It’s also possible I missed major developments while I was away from security research.
This isn’t to discount any of their work. I like each project you mentioned. Anti’s work even happened with the system incentivizing him not to do it and griping at him for it. That’s an extra-respectable sacrifice. To further illustrate, if this was languages, this would be one group saying they were building their own platform gradually vs another saying it integrated with one or more of C/C++/.NET/Java/JavaScript’s runtime, libraries, and tooling. I’d see the latter getting more uptake for technical and marketing reasons even if there were painful consequences to that choice.
All that said, I don’t know that I’d build a unikernel out of Linux. The Poly2 and OKL4 approaches of using pieces that help in isolating ways makes more sense if I was digging deep. OKL4 let you do native drivers/code, stuff in VM’s, and virtual drivers connecting to actual drivers in Linux VM’s. That last thing for drivers, syscalls, or apps is how I’d use Linux code on top of lightweight hardening for it. The unikernel or whatever could run beside it isolated somehow. IIRC, Poly2 folks were straight-up deleting code they didn’t need out of the OS. An eCOS-like, configurable fork of Linux that made that easy always seemed like a good idea.
Too sleepy to say more. Just some random/organized thoughts on all that before I pass out. :)
Reading your first post again, I now get a better sense of what you call technical aspects.
So I stand corrected: then are not “more like economical then technical arguments”, but more akin to “technical arguments entangled with an economical factor”.
Related:
There are a few strong indicators that Linux is entering a new era and slowly but steady get more aspects of a microkernel
“technical arguments entangled with an economical factor”.
You can say that. It works for unikernels a bit if you look at any startups taking VC money to make one happen.
Thanks for the link. I’ll check it out when I can.
There are a few strong indicators that Linux is entering a new era and slowly but steady get more aspects of a microkernel
That talk did rub me the wrong way, for its vast ignorance of the state of the art.
They didn’t even seem to understand what a microkernel is nor what the point of the approach is, which is quite unacceptable past 1995, when the famed Microkernel Construction paper by Liedtke was published.
This is a trait I found to be very common among ignorant Linux fanbois, who mostly do believe that the so-called Tanenbaum-Torvalds Debate was somehow “won” by Linus. The reality of the situation is that, unfortunately, while technically competent in other areas, Linus himself is almost as out of touch with decades of microkernel advancement as the fanbois are.
Also in FOSDEM and much more interesting was this track.
I particularly enjoyed Gernot Heiser and Norman Feske’s talks.
A massive driver and code base and support for everything under the sun feels more like “economical” than technical arguments. Also, a large part of that code base is written for POSIX rather than specifically Linux.
Linux has drivers, that is a strong argument for it. You don’t seem to like it very much I’ve noticed in earlier comments, I don’t really see why given that the thing is open and free and as such can be used as a source of inspiration at least, a source of code in the end for a myriad of applications. Nobody is forcing the use of Linux, if a BSD or one of those fancy verified systems I’ve seen you mention in earlier posts fits the bill those can be used as well. As long as the source is available for tinkering and the licence is amenable to that I’d say hallelujah and get hacking.
The point of Unikernels is that you do not need that much drivers anyway, as most of the devices that will be available will be virtualised anyway. So the amount of existing drivers is not an argument. And as you will be working on the lowest level, lack of stable kernel API is a stronger argument than the amount of available drivers.
Good point on virtualized hardware vs drivers.
So unikernels become about changing the API with the ecosystem from syscalls to generic virtualized driver interfaces (to wrap akin a libc wraps syscalls).
There is more work to do, more overhead to talk to neighbors (network, network everywhere, but also makes all IPC “network transparent”), but also not as much as if a whole kernel with the whole set of driver had to be written.
This depends on programming style also: do you use multiple small programs that get composed together (no, incompatible with unikernels), or big daemons that communicate with other big daemons (yes, compatible with unikernels)?
“You” is generic here, not particularly hauleth.
Also, I do not think you can use an unikernel to act as a driver? Maybe though you can use an unikernel to be focused on hosting a driver for some hardware, then exporting a service (over the network? what else to communicate to an unikernel?) to the rest of the systems (thus not requiring the network).
This can also be achieved with a process though: for instance, plan9 mostly supports one type of filesystem: 9p, and for all other fs types, a daemon reads the /dev/… block device and export an 9p filesystem stream (let’s say an UNIX socket*) to mount as 9p.
Also, I do not think you can use an unikernel to act as a driver?
Refer to Netbsd’s RUMP kernels.
then exporting a service
You’ll find the more you advance in these thoughts, the closer you get to a pure microkernel multiserver design.
From my perspective, unikernels are the technical representation of realizing something is wrong with UNIX’s design then trying to reinvent the wheel, without doing the necessary prior research on what’s available out there.
Now that gets technical and interesting! :)
Happy if you can hint me somewhere I can read about it, otherwise I’ll end-up facing it after some time…
I don’t have any link at hand to some great article about the (bad, there’s also good) side effects Linux’s lack of driver APIs, but here’s some advantages of having driver APIs:
And a disadvantage that gets mentioned a lot is that having driver APIs supposedly facilitates closed drivers and removes the incentive for open ones. As BSDs have excellent hardware support, there’s much doubt about whether this is true, but it’s also possible that this is in some form facilitated by the popularity of Linux, which does not implement driver APIs.
Because a thing is everywhere does not mean it is good. Because it has a massive codebase does not mean it is good code.
You’re talking abstractly instead of concretely about the specific project. It runs in everything from top-of-the-line smartphones to desktops to supercomputers. It must be good for something. The alternative(s) may be better in these use cases depending on what they’re trying to achieve, though. For instance, NetBSD’s codebase is supposedly designed to be very portable and easier to understand. Pieces of its code have been added to alternative OS’s before. I think QNX used its networking stack.
On the unikernel end, one person brought up NetBSD with several mentioning projects are building on other projects. This isn’t probably technical so much as the usual, social factors. In an interview, Anti just happened to be working on stuff for NetBSD, ran into problems, and built something that helped with those problems. Led to more interesting capabilities. Others built on what he already put out because folks often take the easy or obvious path. Others focused on leveraging a certain language building for theirs with that front and center. And so on.
The discussion we’re having is probably a different one than the unikernel authors had when each made their design decisions. ;)
There are a lot of advantages to using a popular technology that have nothing to do with marketing.
Here is a partial list:
What’s your plan for sustainability? This is a gigantic project; how do I know it will be maintained 5-10 years from now?
I can’t know for sure, and I’ll take any tips/hints. (: I am using it for my own email, so at least there’s that incentive to keep it going. It would certainly help to be with more! I also want to keep the maintainer burden low. No separate website. Releasing is mostly just adding a tag. The tests should help keep the code base in working order. And I wondered early on how to keep all the standards/RFCs in my head, and decided to heavily cross-reference the code with the RFCs, which helped a lot. Also, email is not evolving at a high pace… Once functionality is working it may not require all that much ongoing development.
This looks really impressive. The only thing on the not-yet-implemented list that I would miss is Sieve support. Do you have some documentation on your privilege-separation model?
this is all one process. go is supposed to do a good part of the protection. i imagine resource (ab)use could be a issue: memory and file descriptors. i’m aware of openbsd privsep principles. would you have ideas on where separations would be good to have?
and about sieve: i’ve never used it. how does one use it with current mail stacks? from memory, i think it is a way to match messages and take action on them, like moving them to a mailbox, or possibly set flags? how does one configure the rules? just editing a text file on a server, in a web interface, or in a mail client?
That protects you against most memory safety bugs (though Go is not memory safe in the presence of concurrency - data races on slices in objects shared between goroutines can break memory safety), but that doesn’t protect you against logic bugs. A lot of these can be prevented by threading a capability model through your system and respecting the principle of intentionality everywhere, but that doesn’t mean that the principle of least privilege is something to ignore. Mail servers are among the most aggressively attacked systems on the Internet so it’s a good idea to aggressively pursue both.
At a minimum, I’d consider separating the pre- and post-authentication steps. If an attacker compromises the pre-auth process but doesn’t have valid credentials then they should find that they’ve compromised a completely unprivileged process.
The authentication should also then be a separate process. This may need some restricted filesystem access (or limited database connectivity), depending on how you store credentials (or possibly they’re loaded before the process starts before it drops privileges and the auth process is restarted whenever they change - with a target deployment of <10 users, that should be fairly simple), but it shouldn’t be allowed to create any network or IPC connections, or access most of the local filesystem.
The post-auth process should be confined to being able to inherit the network connection created when it is started and having access only to the mail store for the specific user. This ensures that no bug in the code that communicates with a user can perform filesystem accesses for other users. If the backing store is a database, the same applies, just use the database’s ACLs instead.
Some of the other services would also benefit from being compartmentalised. For example, you have spam filtering. A significant proportion of spam emails are trying to ship malware to the user, and compromising the mail server is a great way of doing this (and may avoid the need to compromise the client). Even the simple case of exhausting resources so that the next spam email gets through the filters or all email processing stops need to be in scope for this kind of threat model, so you probably want to process each inbound email in a separate process that returns a single value (spam probability) to the parent and runs with tight resource limits, so the worst that an attacker can do is push an email past the filter.
The component that does Let’s Encrypt / ACME things almost certainly needs to be isolated - anyone who compromises that can sign arbitrary private keys. I don’t know how much of ACME you’re implementing, mail servers often use the DNS-based variant since a mail server may be pointed to by MX records that it does not have an A record for. A thing that can create and update DNS records is a very high-value target.
Similarly, DMARC keys are high value (if they can be compromised then an attacker can send email that is indistinguishable from email that you sent). Signing should be done in a separate process that just does the signing and so even an attacker who compromises a client connection can do an online attack but can’t exfiltrate the keys.
As I recall, Ben Laurie added support for Capsicum to the Go standard library some years back, so these kinds of thing are fairly easy to add in Go programs.
This is just off the top of my head without thinking things through in too much detail. You can probably do a lot better understanding the shape of your code. I’d encourage you to think about three things:
The last high-profile Exchange bug was a violation of the Principle of Intentionality: Exchange had access to a system location, but intended to write a configuration file into the configuration-file directory and did not use anything vaguely like a capability system to prevent this. Capsicum and similar systems make this easy: you would have a directory descriptor for the configuration directory and use it with
openat
and so be unable to create a config file anywhere else.ManageSieve is a protocol for sending Sieve scripts from the client to the server. Clients expose it in different ways. I’ve seen a couple of things that look like the rule editor in Outlook or Mail.app but I tend to use a Thunderbird plugin that exposes the scripts directly. It is a bit nicer than just editing the script on the server because ManageSieve doesn’t let you install a script with syntax errors and lets the server report the error to the client, so the client doesn’t need to know every extension that the server supports (and there are a lot)
Dovecot also has support for IMAPSieve, which runs sieve scripts in response to events. This is most commonly used for detecting things being added to or removed from a spam folder to trigger learning. I think you have built-in support for that? It can also be used for things like providing a virtual mailbox that auto-files email according to your latest rules if you drop mails there, or running external scripts so you can copy an email with a calendar attachment into a mail box and have the attachment passed to your calendar, and other automation workflows.
thanks, that’s a lot of good info!
valid point about the logic bugs. i’m wondering how difficult it is to take over a go process. whatever the answer, having separated privileges as a layer of defence will certainly make it safer. pre-auth and post-auth, and per-logged-in-user-processes, and key-managing-processes all sound right. i’m going to put it on the todo list.
mox uses tls-alpn-01, which is why it needs port 443 (along with for mta-sts and autoconfig).
but managing dns records is an interesting topic. i would like to be able to do that, mostly to make it easier to set up/manage mox (i believe many potential mox admins would be pasting dns records in some web interface zone import field. if they are lucky. and creating records one by one in a web interface otherwise. also, with dns management, mox could automatically rollover dkim keys in phases, update mtasts policy ids, etc). but i don’t know of a commonly implemented dns server api i would use. i don’t want to make it harder to set up mox. if anyone knows there is a way, please let me know!
make sense, although yet another protocol to implement… i personally am probably fine going to a web page and editing the script there. mox already has a web page you can manage (some of) your account settings in. it currently only has basic rules for moving messages to a mailbox when they are delivered, see Rulesets in https://pkg.go.dev/github.com/mjl-/mox/config. these can be edited in the accounts page.
yeah, i recently added a simple approach for setting (non)junk flags based on the mailbox a message is delivered/moved/copied into. see https://github.com/mjl-/mox/blob/ad51ffc3652ff19a1265fe2831c83ebf669ecdc3/config/config.go#L210. i looked at mail clients, but did not see behaviour to set those flags conveniently, e.g. “archive” in thunderbird does not mark a message as nonjunk, etc.
interesting. this is certainly not possible in mox. there isn’t even the notion of a user (uid) to run such scripts as. sounds like adding useful sieve support may need that. this is going a bit lower on the todo list. (:
I recommend not thinking about “Go processes” as a process is a process is a process.
Another good example of real world recent vulnerability in mail servers is CVE-2020-7247 and its resulting security errata: a remote code exploit in OpenSMTPD from 2020.
Priv-sep specifically didn’t mitigate against that vulnerability and it had nothing to do with memory safety, but being able to have a mental model of which parts of your program are operating with specific capabilities will make it easier to audit and easier to respond to inevitable vulnerabilities. It’s not a matter of “if”, but “when”.
It should really be underscored that without doing the extra legwork with capabilities or bubblewrap or putting different apps under different users or really anything (there are a ton of methods here) that merely deciding to have extra functionality in a separate process owned by the same user offers absolutely no extra security protection and in the case of “process each inbound email in a separate process” has considerable downsides. You do mention restricting fs/ipc/net in the auth paragraph though. I guess what I’m saying is that end-users need to be aware there is extra work to be done there if they take that route. Sure different stack/heap so you aren’t hitting footgun issues, but once that process is owned the rest don’t matter. In this example mox suggests creating a mox user with seven additional setup commands. That’s great and I bet a lot of people ignore that and just sudo su their way to freedom cause it’s not enforced. If mjl decides to break this into separate processes there is going to be a lot more setup involved then. This is pointed out because there is a lot of language online about “just put in another process”, and then the end-user is not told or is unaware that they need to do all the extra work that is required to reap the benefits.
noted. ideally i would prefer do all the separation in mox itself, not requiring extra tools like bubblewrap.
about the additional commands, i probably should validate permissions at startup. mox currently only checks that it doesn’t run as root.
Please ignore the above message. None of this needs to impact the end-user experience. Dropping privilege for a child process without creating different users is supported on all major operating systems.
Almost none of what you say is true with any vaguely modern *NIX system operating system. They all provide mechanisms for a process to drop privileges and run with less than the ambient authority of the user that started as. FreeBSD has Capsicum, XNU has the sandbox framework, OpenBSD has pledge, Linux has seccomp-bpf / Drawbridge / whatever they are doing this week.
I didn’t claim that. If you look at my very first sentence I’m pretty clear that there are plenty of methods to deal with this, however merely spawning a new process does not give you inherent added security which gets implied in many articles and discussions and which was my whole point.
You put your skin in the game and contribute.