Interesting takeaway: if you go several years without rebooting, when you do it will longer than expected. Reboot early and often!
A lot of people will stop debugging a distributed system as soon as it’s up and running again. Once stability is reached, I like to review the forensic information available and then trigger another failover, to make sure it works correctly. If it doesn’t work, the system was never fixed to begin with, and at least there are people awake and ready to intervene immediately, rather than finding out by being woken up.
Takeaways for me:
The only issue I have with the feedback is that the numbers might not mean much to some people. I’ve been on my team for two years now and big numbers don’t scare or impress me as much as when I started working. Yet, when I tell some of my colleagues the number of devices in my fleet, their jaws drop. I guess I mean 12 servers in billing could be a lot, or none at all, I’m not sure. Maybe it’s a blip, or maybe that’s all of production, or maybe it’s enough to cause network partitioning or to disestablish a quorum.
Moreover, you’re right in that feedback in often ignored, but usually because feedback becomes a habit. Instead, elect to only provide feedback on bad input or input that will cause issues,