1. 17
  1.  

  2. 1

    I know that QUIC is implemented in userspace so that they can iterate on it quickly, but is implementing it as a Linux kernel module possible or on a roadmap anywhere?

    1. 3

      so that they can iterate on it quickly

      There are other reasons too - to be able to deploy it on a wide range of systems, to have application control of pacing, etc.

      but is implementing it as a Linux kernel module possible or on a roadmap anywhere?

      I don’t know of anyone working on this at the moment, and I’m not sure I’d think of it in either/or terms. One example: as they mentioned in the article, QUIC ACKs are encrypted. You don’t want to stuff OpenSSL or similar into the kernel, but you also don’t want the kernel calling back to userspace for all decryption - so what would “QUIC as a kernel module” look like? One possibility is a hybrid mode, something like kTLS where the handshake is userspace then symmetric keys are handed off to the kernel.

      In general, this is how I would think about QUIC’s performance: starting as 100% userspace, then progressively finding which parts should be offloaded to the kernel to make it fast and developing those APIs. Right now, the most important thing is hardware GSO and zerocopy, but doing it in a way that userspace still has control over pacing. After that I expect there will be work on hardware crypto offload.

      1. 2

        I’m surprised the CDN people aren’t all-in on removing the kernel boundary as much as possible (netmap, DPDK, VFIO, unikernels…)

        1. 4

          Speaking as an ex-CDN person: a little goes a long way. 🙂

          CDNs contain a lot more business logic than you’d expect, so raw network performance is important but has to be weighed against many other factors.

          My problem with some technologies like DPDK is they can end up being “all or nothing.” Taking QUIC as an example - a QUIC client always needs the option to fallback to TCP, so what does this mean for my CDN fleet? If I write a high-performance DPDK QUIC implementation, do I need to also write or ship a custom TCP implementation, do I need to segment my fleet between DPDK QUIC and Linux TCP hosts (which is either very difficult, because predicting capacity between those two is hard, or impossible, because many CDN installations are tiny and that kind of segmentation is cost prohibitive). If I use DPDK’s KNI to bounce non-UDP to the kernel TCP stack, what does that mean for performance - how much overhead does that introduce, and how many bugs am I going to “discover” (because how many people total in the world are using that configuration)? If I instead use a bifurcated driver, what NICs does that force me into buying?

          So, using something like DPDK as a policer/traffic shaper: yes. Using it as an L4 loadbalancer: maybe. Using it as the underlying network stack on an L7 proxy fleet: unlikely.

          In general I don’t want to do everything. But there are a few very important things I want to do fast - and XDP/eBPF/kernel offloads generally get me what I want in a package that’s easier to build/iterate/debug than full bypass.

          Edit: I have more interest in DPDK for something like a VM substrate network. But in that case it’s a shame to use cores that could be rented to customers, so I’d favor smart NICs if possible.

        2. 1

          You don’t want to stuff OpenSSL or similar into the kernel, but you also don’t want the kernel calling back to userspace for all decryption - so what would “QUIC as a kernel module” look like?

          For inspiration here, I’d look at what Netflix has done with TLS in the FreeBSD kernel. Connection establishment and key exchange is all in userspace but the bulk crypto is in the kernel. That’s necessary to make APIs like sendfile work: the kernel needs to be able to DMA data from disk, encrypt it, and DMA it out over the network without userspace needing to do a copy.

          A few NICs are now able to offload the bulk-crypto parts of TLS (which are more or less the same in QUIC), so I wouldn’t be surprised to see something similar for kernel interfaces. Once you move the bulk crypto into the kernel, you can also segregate the key management into a separate process quite easily, which has nice security properties. During session establishment (and key renegotiation), the file descriptor for the socket can be owned by a process that has access to the certificate and private key. Once the symmetric key for the connection is established, the file descriptor can be passed to another process and all of the crypto can happen in the kernel. This makes it easy for TLS / QUIC libraries to expose APIs that make it easy to completely remove the private key from the process doing the work, without adding an extra copy or context switch on the fast path.

          1. 1

            Yup! This is the type of thing I was referring to in the sentence that followed - “One possibility is a hybrid mode, something like kTLS.” kTLS is the approximate Linux equivalent of Netflix’s BSD work.