1. 10
  1. 3

    Interesting stuff! Maybe my overlay networking knowledge is below the intended audience, but I wish there was more exposition/links about the “well-documented drawbacks” and the “better reliability and scalability characteristics” mentioned.

    The Kubernetes networking space is sadly filled with myths and brand-loyalty since so few of its users actually understand it (myself included, at least about the overlay part since that abstraction has proven to be non-leaky for me). So more writing like this from people who actually understand the stuff is very welcome!

    1. 4

      The Kubernetes network model requires each pod (co-located process group) to have its own IP addresses. The IPv4 address space just isn’t large enough to accommodate that gracefully.

      Let’s say you use for internal network addresses. Each datacenter might be allocated a /16, so you can have 255 datacenters with ~65,000 machines each. Sounds great! You’ve got a lot of scaling headroom.

      Now you want to introduce Kubernetes. Each pod gets an IP, so instead of having room for 65,000 big beefy rack-mounted machines, your address space has room for 65,000 tiny ephemeral pods. The operator is forced to choose between two bad options.

      Option A: Static assignment of IPv4 prefixes to each machine. If you allocate a /17 to Kubernetes pod IPs, and have 2000 machines, then you can run at most 16 pods per machine.

      • That’s an order-of-magnitude reduction in maximum per-cluster machine count (65,000 -> 2000).
      • Utilization of machine capacity is low because 16 cpu=1 services can take all the pod IPs on a 100-core machine.
      • Utilization of IPv4 address is low because a single big workload might take up a whole machine (stranding 15 addresses).

      Option B: An overlay network with dynamic assignment of pod IPs from a shared pool. This improves utilization because IPs can be assigned where they’re needed, but now you’ve got a bigger problem: knowing a pod IP is no longer enough information to route a packet to it!

      • Now you need some sort of proxy layer (iptables/nftables, userspace proxy, “service mesh”) running on each machine.
      • Every time a pod gets rescheduled, it gets a new IP, and every machine in the cluster needs to update its proxy configuration. The update rate scales with (cluster size * pod count * pod reschedule rate), which is too many multiplications.

      When you see the dozens of startups involved in Kubernetes networking, which all promise better throughput or lower latency or whatever, they’re fighting for the market in tooling created by “option 2”.

    2. 2

      Is there a reason you couldn’t just give the pods those actual IPv6 addresses without the 6to4 overlay? That is, this boils down all the pod IPs to one IPv4 source address that can be handled by conventional ARP + switching. So is there any way to do “ARP for an IPv6 prefix” and switch on MACs resolved that way without dragging in the routing layer? You could probably run radvd on all the kubelets advertising their own personal IPv6 subnets, but I don’t know how well that routing table scales to 10k+ advertised routes. Do switches listen to those kinds of advertisements and handle routing, or is that purely in kernel on each machine participating in the network?

      1. 3

        Is there a reason you couldn’t just give the pods those actual IPv6 addresses without the 6to4 overlay?

        You’d need to be running an IPv6 network. If you’re doing that then you don’t need a 6to4 overlay[1]. There’s unfortunately still many hosting providers that don’t offer IPv6 addresses on their internal networks.

        [1] Speaking generally. There are non-routing reasons you might still want a network overlay, for example QoS in a cluster shared between different customers, or enforced encryption. These would probably be 6in6, GRE, or something specialized like WireGuard.