1. 9
  1.  

    1. 8

      As an engineering leader ultimately responsible for site-up, you need to be able to succinctly communicate incident impact to sales, support, marketing, legal, and other executive leaders. This typically necessitates some formulation of severity. A useful severity summary should track number (and if possible, which) customers are impacted, which parts of the service are unavailable, if data has been lost, if data is has been leaked.

      If you’re in control of your company’s incident response policy, consider whether a severity scale is going to be helpful to responders in the moment… [W]ould you rather use a system that asks “How bad is it?,” or one that asks “What’s our plan?”

      Not everything an engineering team does is strictly for itself. You need to be able to tell the rest of the company “How bad is it.” so that they can inform customers (and sometimes regulators and lawyers).

      1. 2

        The first thing I do when I get paged is assess “how bad is it”. I can’t imagine working through an incident without some idea of the severity. It informs not only the investigation (a wider scale issue suggests different likely causes than a one off issue) but also what actions to take (does it require immediate but messy workarounds, or can it wait for a more careful fix?).

        1. 2

          My last decade has been in security and fintech; lots of time with lawyers and regulators.

          I’ve often seen teams avoid raising incidents because the process had been co-opted as organisational reporting tool. IMHO telling the rest of the company “something bad happened” is low-context and should be detached from the how a team itself learns “something happened, maybe good maybe bad maybe a near miss.”

          See:

        2. 3

          There’s certainly a lot to be said on the value of incident severity, however the title leads with a strong stance that they are a waste of time but then proposes another subjective scale that just involves complexity instead of impact.

          I like John Allspaw’s dive into this topic where they spoke to a bunch of companies on what incident severity meant to them and how it was being used, https://www.adaptivecapacitylabs.com/2019/05/20/the-negotiability-of-severity-levels/

          1. 1

            I love how this addresses the behavior that an incident and its severity can drive! I alluded to this in sibling comments, but this article is a rich exploration of the topic.

            Please submit it as story! 🙏🏾

          2. 3

            There’s really only two severities: page someone, or don’t page someone. Fix it now, or fix it soon. 24/7 or 8/5 All attempts at more granular classification (at least during an inc) are a misuse of time.

            1. 2

              I agree with that part, but I don’t think the scale ends at “page someone”. At least in larger teams, there is one or more steps of escalation beyond paging someone.

              1. 1

                Yeah, but that escalation is usually done by the person paged. How I read the article, they are arguing that those documents should be written for the handler. Severity doesn’t map well to complexity.

              2. 2

                I like to call that “urgency” because it’s affected by non-technical factors. You might have a small cosmetic bug but it’s urgent if it impacts an important sales call or a strategic customer.

                1. 1

                  Ahh, the Eisenhower Method:

                  |           | Notify now (page)            | Notify later (log)              |
                  |-----------|------------------------------|---------------------------------|
                  | Fix now   | Revenue is zero              | System is almost over-capacity  |
                  | Fix later | Revenue is lower than normal | System is almost under-capacity |
                  

                  For incidents, IMHO, “who and when to page” is crucial. That is, we’re talking about a decision which an entity makes for itself. But, there are many entities inside an organisation and they’ll make different decisions. e.g. the point that remaining capacity on a non-critical reporting database becomes Important differs between whoever owns the ETLs, the infrastructure, and the reports.

                2. 3

                  I like the idea of challenging norms, but its very smart to assess how critical something is before working on it.

                  Yes, it consumes time, but engineering resources are typically limited (even more so after hours ;)).

                  Triage that can be done by less qualified automations or even people, definitely aid in the mobilisation of an appropriate effort.

                  Your “SEV” levels just need simple definitions:

                  Blackout vs Brownout; User-facing/External vs Company-facing/Internal. Your issue severities should be a combination of those. Though this assumes your internal systems can survive a night or 5 of being unavailable- that database backup thats failing for 4 days could suddenly become incredibly important.

                  1. 3

                    What about severities that are a lookup of t shirt levels of bad.

                    First digit for how broad the care group is. IT, engineering, company, some customers, most customers, all customers.

                    Second digit, data integrity risk. None, low, high, already lost some. Maybe this should actually be risk of altered performance after this incident.

                    Third digit, how long can it stay stable before the first digits get worse. A month, day, hour, It can’t get worse.

                    This would be a lot for an engineering team to manage, but not for an incident manager who’s primarily communicating to others. And also lets others see… We have a SEV-1, but will I care about it tomorrow.

                    Color code each column for maximum communication

                    1. 2

                      I think that’s a pretty solid approach actually, if your company is Disney sized or something I can see a policy like this working extremely well.

                      Though you’d need punitive measures in place to prevent the stealing of resources, as soon as subjectivity enters something like this then managers game it to get more attention than they should to pad their numbers.

                  2. 1

                    Which incident should get the resources? I simple high-impact incident or a complex low-impact incident.

                    The impact is all that matters. If there is a system that is very complex and not having much impact, the next “incident” should be turning it off.