1. 5
  1. 5

    I don’t think anything can teach you the value of operability (stuff like good logs, metrics, dashboards, exception tracking…) better than being on call. When something is going wrong, you need insight into what is happening in the system.

    1. 1

      Inspiration to find a better position. Most things on this list I can learn about without being on call. A common thing in safety-critical field to learn how to handle what shows up during emergencies is to study what already has plus the response and fix. Lots of write-ups on stuff like this out there. I do respect the author doing a positive spin on it if only to show potential benefits of the position and maybe give people in it an optimistic way of looking at the headaches they’ll endure. As noted in the article, doing it for a while can also make some people stronger or better able to handle emergencies.

      1. 7

        You do have to go on call to learn the most important lesson though:

        “Being responsible for my programs’ operations makes me a better developer”

        The output of a developer directly contributes to the the quality of life of the poor soul who has to maintain it.

        Sysadmin’s often are not empowered, skillset wise or with the authority, to make their lives better by changing the software so all they can do open some godawful Jira ticket and live at the bottom of someones sprint queue.

        Developer’s tend to think perfection is a unit test, but their priorities somewhat change after they have had to babysit the fruits of their labour for a few weeks :-)

        The fact is, no one cannot really understand the problems you cause others till you directly experience them yourself. It changes you and stops you siding with “well Brian is a sysadmin, so of course he is always grumpy” and thinking he might be exaggerating.

        1. 1

          The fact is, no one cannot really understand the problems you cause others till you directly experience them yourself.

          I understand most of them by reading them. Then, when I experienced them, it was much like what I read about. This works so long as your prior experience has something similar enough to draw on, esp on feelings. For instance, a person whose jobs have been pretty casual with steady hours will probably get the impact you describe when losing sleep, fighting against the clock, and so on. Whereas, someone who had already lost sleep or fought against the clock would be able to empathize with what a writer is going through just reading their disaster response report.

          Now, whether they then care enough to prevent that when programming is another matter entirely. The developer might go all out on that, half-ass it, or change positions to where they can ignore it.

      2. 1

        One thing I found helpful was an SRE-ism I’ll paraphrase here: “you should only be paged for things that require human-level intelligence.” The system should handle as many ‘routine’ tasks as possible, and only notify you when the answer isn’t clear-cut. If your response to a particular page is always “click this button,” that’s a prime candidate for automation.

        There are two implications to worry about here: suddenly all pages become hard and if pages become rare enough the humans will get out of practice.