No mention of related work, but it’s worth noting that most other VMMs do something similar. I think KVM now uses some SELinux policies to restrict the things that the QEMU devices can use. Bhyve uses Capsicum for the management process (which contains the emulated devices) so it has no access to things that the VM shouldn’t be able to reach. For extra paranoia, the bhyve bits can also be jailed, so you need a capsicum escape then a jail escape to touch anything on the host. This is most useful when combined with ZFS delegated administration because it lets you do snapshots of VM disks in a jail.
I’m still a bit sad that emulated devices work this way. There was a prototype in Xen ages ago (2008ish?) for reflecting VM exits back into the VM to run some code injected by the hypervisor into the region reserved for the BIOS. This would let you expose a narrow paravirtualised interface to the VM from outside and done all of the device emulation ‘inside’ the VM, so any attack from the guest kernel to the device emulator results in no new privileges. You still probably want to sandbox the PV host component, but with S-IOV / Revere there’s going to be a lot less of a need to have that part at all (it’s basically going to be just for storage and block storage is a simple interface to secure).
…most other VMMs do something similar.
…most other VMMs do something similar.
I think you’re missing the part about moving emulated devices into individual processes. Yes, modern VMMs use OS level capabilities to harden themselves…but not all will put the virtio network device emulation in its own process.
(I’m the author, btw.)
Xen had some options to do that around the 3.0 era but they largely went away because they didn’t make sense in the threat model. If the device emulators are per VM, and they’re limited to doing things that the VM is meant to do, then further isolating the, doesn’t gain you anything: you assume that they are all fully compromised, from the perspective of your threat model.
I realize I am out of my depth with this discussion, but why do you assume that they are fully compromised? Suppose a remote compromise is found in the vmd network driver, such that the network driver is executing the malicious code. If the network driver is properly sandboxed, then it can be limited to the basic activities of reading and forging packets. A malicious actor is certainly at the front door. But that doesn’t imply that a compromise of the network driver would result in the capability to read the contents of the block device, or arbitrary memory, or interact with, for example, a trusted platform device that’s been passed through to the VM and may contain sensitive keys. This is just what the paper says AFAICT. It seems like that may be the difference between a SEV1 and a SEV3 event from a business standpoint? Don’t different organizations have different threat models?
Don’t confuse drivers with devices. My work was on hardening the host-side device that needs to talk to a guest-side driver and also expose itself to the network (like with a virtio network device). It’s not about protecting other guest resources, but reducing the foothold for exploiting the host.
What you say is true if the guest is a microkernel, but typically it’s a monolith or a unikernel. This means that anything that can attack the device emulator is running with ring 0 permission in the guest. Using a compromise in one emulated device to attack another isn’t useful because the attacker already has the ability to talk to that device directly. A guest can already write arbitrary blocks to the block device and so attacking the emulated NIC to gain this ability doesn’t make sense.
The place where it might make sense is if you have a bug-free block device emulator that has access to some ioctls that can be used for local privilege elevation and a big in the NIC driver that can be exploited to give arbitrary code execution. That’s the sort of thing I’m happy to see defences against if they come with no cost but it should be a lower priority.
A vTPM or similar is an exception to this and would merit process isolation (and, ideally, isolation from the host OS as well).