1. 7
  1.  

  2. 7

    Clicked, expecting to see Kubernetes mentioned.

    Was not disappointed.

    At a previous job, nodes in our EKS cluster would randomly lose DNS to other services in the cluster. AWS support couldn’t figure it out; our only solution was to terminate and replace the affected node whenever it happened.

    I’ll have to watch the video and see if their problem is anything like what we faced.

    1. 6

      I skipped to the end. It wasn’t about Kubernetes. I’m not sure if I should spoil it here. The journey is the bigger part of the talk, and a good lesson to take is to be willing/able to dig into how your libs work (not just YOUR code, but underlying dependencies). Implementation details can be important.

      1. 1

        At a previous job, nodes in our EKS cluster would randomly lose DNS to other services in the cluster. AWS support couldn’t figure it out; our only solution was to terminate and replace the affected node whenever it happened.

        This has happened to me once or twice (in the past year), as well. Terminating the instance was also the only fix I could find.

      2. 5

        This is a really good debugging deep dive, so many twists and turns!

        1. 1

          I think that this really comes down to the important distinction between abstraction and obfuscation. I don’t believe that RPC is an abstraction on-top of message passing, I believe it is an obfuscation. Even their fix is a bit wrong, because as they said, for a different use-case in their code those values were correct. If they were simply passing protobuf messages back and forth rather than using an RPC obfuscation that they don’t fully understand, this bug would have been much easier to find, indeed probably would have never occurred. The weird thing, is that RPC generally doesn’t even reduce the LOC count vs simple message passing. You have a caller and a receiver and you have to deal with async just the same…

          1. 1

            I don’t think the higher-level model is a problem in this case, but instead a combination of a couple of things:

            • GRPC includes fairly complicated service discovery and failure recovery machinery; they fixed their problem by reconfiguring it. A message-passing system would still need service discovery and failure recovery, so it could still cause overloads in the same way.

            • Their orchestration around restarts is ungraceful: they tear down the old service before traffic has moved over to the new service, and they heavily rely on their ingress layer to hold requests while the pods restart. They could also have solved the problem by starting the new pods first, reconfiguring service discovery, waiting for traffic to migrate, and finally tearing down the old pods.

            (There is, in fact, a small DNS issue, in that GRPC ignores the DNS TTL and has its own configuration its DNS cache refresh timer; in this case the GRPC timer was longer than the DNS TTL so it made restarts slower, exacerbating the problem.)