1. 35
  1.  

  2. 15

    Interesting takeaway: if you go several years without rebooting, when you do it will longer than expected. Reboot early and often!

    1. 5

      A lot of people will stop debugging a distributed system as soon as it’s up and running again. Once stability is reached, I like to review the forensic information available and then trigger another failover, to make sure it works correctly. If it doesn’t work, the system was never fixed to begin with, and at least there are people awake and ready to intervene immediately, rather than finding out by being woken up.

    2. 9

      Takeaways for me:

      • Even if you have scripts and checklists, human errors will still happen, as in this case.
      • To prevent human errors, tools have to be made more complex (e.g. here they talk about introducing the notion of minimum capacity and rate limiting on changes).
      • Maybe it’s necessary to spell out or visualise the expected result of each command or action (e.g. “Servers to be taken offline: 12 in the billing subsystem, 3500 in the index subsystem. Proceed?”)? Perhaps if the person got enough feedback about the number/type of servers that would be removed, they would have backtracked? I know people ignore a lot of feedback, but hopefully not in case of operations like this.
      1. 2

        The only issue I have with the feedback is that the numbers might not mean much to some people. I’ve been on my team for two years now and big numbers don’t scare or impress me as much as when I started working. Yet, when I tell some of my colleagues the number of devices in my fleet, their jaws drop. I guess I mean 12 servers in billing could be a lot, or none at all, I’m not sure. Maybe it’s a blip, or maybe that’s all of production, or maybe it’s enough to cause network partitioning or to disestablish a quorum.

        Moreover, you’re right in that feedback in often ignored, but usually because feedback becomes a habit. Instead, elect to only provide feedback on bad input or input that will cause issues,