We’ve run into and had to deal with at least a couple additional cases to consider:
Terminating a pod triggers various async state changes in the cluster, including propagating network config updates where each host updates their iptables network overlay configurations, load balancer updates, and the like. It’s possible, perhaps likely in some setups, that a pod terminates well before the network changes have converged. When this happens requests may still be forwarded to the now terminated endpoint resulting in failures. Using a preStop hook with a sufficiently long delay works around this. They suggest using a preStop hook for other reasons, and show a ~4 second delay, which may be implicitly sufficient to worth around this on their end.
Your service must handle SIGTERM properly. For example, your service may be handling a long running request. If your service tears itself down immediately upon receiving the SIGTERM, these in-flight requests will fail. Instead you’ll need to trap the SIGTERM, wait for in-flight requests to complete, and only then shutdown. Assuming you’ve addressed the networking issues in #1 you do not need to worry about handling new requests arriving after the initial SIGTERM since at that point no new traffic is routed to the pod. Luckily more and more frameworks these days have support for graceful termination. Note that there’s a configurable hard limit, terminationGracePeriodSeconds, that limits the max duration a service is given to complete shutting down cleanly.
Great points and I would add that I don’t see enough of this advice out in the wild. Proper use of preStop and handling of signals is absolutely imperative for zero dropped packet deploys in Kube. I expect this catches out organisations the world over.
Indeed. That’s the reason for the 2 second delay at the beginning of our preStop
Lots of trial and error!
2 seconds ain’t all that bad! We use 15, which works but I’d sure like a deterministic way to know when it’s safe to start terminating (I appreciate the complexity of this as there’s lot of moving parts and even more so at scale). I’ve seen anecdotes that GKE deployments need 30+ seconds… trial and error as you say : )