1. 24
    1. 1

      A nice and thorough post. If I had read it a couple of months back it would have actually saved me quite a bit of time fumbling around with this and figuring it out for myself.

      I’d like to add a few interesting anecdotes that are related to this API:

      • The userland MMIO interface is only able to express loads and stores (type is a boolean, not an enum) of up to 8 bytes (there’s an array of 8 bytes). So PCIe atomics and transactions with size greater than 8 are inexpressible. CPUs can initiate the latter with vector loads, for instance.
      • Port-mapped IO is recommended over Memory-mapped IO for performance of guest<->VMM communication. I think this is because there’s less demultiplexing logic to do in KVM before arriving at the conclusion that the VMM needs to be notified.
      • To provide the nice MMIO API to the VMM, KVM must disassemble the (page-)faulting instruction. There’s an acceleration feature in some AMD CPUs to copy over instruction bytes to the hypervisor on VMEXIT, to save the hypervisor the work of reading it over from guest memory.
      • This disassembler does not support SSE at present, so KVM currently doesn’t even fragment a 16-byte MMIO to 2 x 8-byte, it just raises an undefined instruction trap into the guest. QEMU’s TCG (userland JIT engine) does fragment such accesses.
      1. 1

        CPUs can initiate the latter with vector loads, for instance.

        I don’t believe this is the case. Vector loads and stores are not guaranteed to be atomic, they can be split into smaller values and typically are for uncached memory and on systems where the memory bus is narrower than the vector width. This is why recent versions of AArch64 and x86-64 have added custom instructions for 64-byte loads and stores (even if the memory bus width is narrower, these are tagged in transit and reassembled at the PCIe controller). At least the Arm versions are permitted only on uncached memory and so I’m not sure if you can emulate them in a hypervisor. You probably don’t want to because you would only want them for performance and a paravirtualised device with an in-memory descriptor ring will do better than emulating full PCIe messages.

        1. 1

          Hm, I’m certain vector accesses work, but I’m not certain if/when they get fragmented. A recording may be needed to confirm which is the case. Regarding usefulness, it would be nice to have this supported in virtualization scenarios not from a performance perspective, but from a running-unmodified-software perspective.