1. 44
  1.  

  2. 9

    I’m also going to argue that focusing on the SEV count pushes you towards a MTBF mindset. I think historically we’ve seen this mindset encourage deploying really complicated things that, on paper, look like they have fault tolerance.

    Remember the source control company that has a really complicated giant SQL database with automated failover, which tripped when it shouldn’t have and gave them a multi day outage instead of an hour or two outage.

    If you focus on the impact metrics instead, you can get a healthier mindset going on. e.g. if you want the “people woken up in the middle of the night” impact number to go down, you can do that by hiring ops people in multiple timezones and giving them tools that let them mitigate problems.

    1. 8

      I think this article is definitely on the right track but I also think some of the prose needs sharpening.

      To be more specific, counting outages is actually SUPER important. You can’t know where your system or service is going wrong and where to look for the land mines unless you collect seething gots of fresh wriggling data.

      So, the counting isn’t the problem, it’s using the metrics you gather on outages as effective punishment for teams or employees that kills morale and creates all kinds of unintended side effects like making it attractive for people either not to accurately report or to lie about degrees of outage severity.

      This gets to the real meat of the problem, and that’s creating a healthy culture around avoiding best intentions and bringing science to bear in learning from mistakes as a team and as an institution rather than at the individual level.

      1. 17

        So, the counting isn’t the problem, it’s using the metrics you gather on outages as effective punishment for teams or employees that kills morale and…

        I strongly disagree.

        It’s human nature that as soon as management starts counting something, it will start acting on it whether it means to or not. Goodhart’s Law sets in.

        RBTB suggests much better things to track in her article: the user-visible impact of SEVs: failed requests, latency, lost data, users having unusually bad days. These are a bit harder to game (obvs not impossible ofc) and don’t have quite the same deleterious side effects. You could also track the direct toll on your organisation: people woken up, person-hours spent.

        One thing I think is important to think about is what happens if the bar for what counts as a SEV moves? e.g. if my monitoring gets better, I push the SEV button earlier, resulting in more SEVs with lower individual and total impact on users because problems are being found faster. This is going to make a SEV counting metric look worse, but actual user service got better.

        RBTB’s article explains already why the converse has bad consequences.

        1. 7

          You’re right that the hours spent on incident-handling are more important than the total number of incidents. However, I must disagree with you and the author both, by making an analogy to workers’ compensation for injuries sustained on the job. Yes, Goodhart’s Law implies that for any metric which properly correlates with the amount of compensation paid, the organization will try to minimize that metric, including distortions and inefficient practices. However, we must still try to report the metrics honestly, because it is unjust to laborers to lie about these metrics! Management should be obligated to honestly compensate injured workers, and similarly, they should also be obligated to honestly account for how toilsome their services are to maintain.

          Maybe this analogy is too on-the-nose. It is literally injurious to a person to be woken up in the middle of the night due to an outage.

          1. 5

            I don’t think that analogy makes sense. Workplace injury and this other thing are two completely different things. In many jurisdictions, employers are legally required to handle workplace injuries. Workers have rights, etc when it comes to workplace injury. So there’s an actual legal reason for employers and employees to accurately report this stuff, even if some might choose not to. I’d argue the motivation for doing so far outweighs any motivation not to, at least in the US.

            On the other hand, no one, other than employer management, requires employers to care about fires within the company. Governments/courts don’t care about any implicit “harm” done to employees as a result of these things, except perhaps “mental health”, but it’s rarely due to any single event.

            1. 2

              Waking people up in the middle of the night materially harms them, especially if you do it repeatedly. This might not be widely recognised legally at the moment but it’s a thing that happens.

              1. 1

                Well, sure. But that’s not the point… The point is there are different motivations for complying with reporting recognized workplace injury situations, than there are for reporting “omg our work/project/whatever is on fire” situations. So the analogy is not a good one.

                1. 1

                  Legally it’s different right now. Morally I don’t think there’s a difference.

                  Eventually someone will win a lawsuit in which they claim that being woken up in the middle of the night repeatedly by pagers caused them cardiac distress.

                  1. 1

                    You shouldn’t have to treat this as a “legal” workplace issue. Good engineers are a scarce resource, and they can get other jobs easily. If a company misuses a scarce resource by allowing pagers to go off all the time instead of actually fixing the underlying issue, they will lose good employees and go out of business.

            2. 1

              Btw while I did get a bit defensive in the other comment about i-already-mentioned-that, I am extremely glad that other people feel the way you do. ❤️

              1. 3

                No worries. I’m not great at using my words to say, “thank you for your thoughtful contributions,” instead of a mere upvote. Still, thank you for your thoughtful contributions.

              2. 1

                This is why I was advocating for tracking how many times you wake a person up, too.

                (Briefly in the comment you replied to, and at a little more length in the other comment.)

                FWIW I have personally burned out hard from a previous job at which I was the only person on-call for more than a year. I happen to intensely hate Australian government DNS server admins (specifically). :P

            3. 3

              If you’ve got actual service outages all the time, your business is not ready for prime time.

              If you’ve got repeated performance degradation, you have problems to solve. Counting incidents is the least interesting metric, and not one which will help you solve the problem.

              Knowing what time they occur might be interesting. Is there a correlation?

              Knowing how long they last might be interesting, or it might not. If they solve themselves, it’s definitely interesting.

              Knowing what actually happened: that’s the information you need. What do you need to be more resilient against? Is your load balancing skewed? Is your database slow? Does your application ask for the same data repeatedly and throws away different parts each time? Are you being attacked? Is a customer making weird requests that are effectively an unintentional; attack?

              1. 1

                Knowing what actually happened: that’s the information you need. What do you need to be more resilient against? Is your load balancing skewed? Is your database slow? Does your application ask for the same data repeatedly and throws away different parts each time? Are you being attacked? Is a customer making weird requests that are effectively an unintentional; attack?

                Precisely. You MUST capture a fairly rich data set in order to be able to intelligently answer these questions, and again your company/team/institution has to prioritize asking them and encoding those learnings into improved process or automation to keep them from happening again.

                Amazon’s culture has the notion of the COE - Correction Of Error, which is a document you write that asks the writer to answer a series of very incisive questions that help you collect the most important data and utilize that data in making change.

                There are myriad variations on this theme and as long as they’re the same basic shape you’re doing it right IMO.

                1. 7

                  Nobody’s disagreeing with that. What Rachel is saying is “don’t maintain a count of incidents” or anything similar to that. That particular thing is anti-useful, even if anybody with six brain cells to rub together can recover the data by counting the incident reports.

            4. 1

              Nitpicky, but please don’t use mean/sigma to find outliers in your metrics. Use a robust indicator like median/MAD or, if your distribution is asymmetric (and it probably is) look at a histogram.