1. 30
  1.  

  2. 11

    Something like “this SSL cert is about to expire in a week” would, prior to LetsEncrypt, make sense as a thing that automatically files a ticket which must be acknowledged and marked complete. “This RAID set is down one disk” is another less dated example.

    Maybe a viable solution is to replace low priority alerts with automatically generated tickets. The goal would be to make them way more inconvenient so there would be more pressure to turn off the sources of them rather than letting a deluge go through ignored.

    1. 2

      Maybe a viable solution is to replace low priority alerts with automatically generated tickets.

      This is indeed exactly what was done at GitHub while I was there. The opened issues would CC relevant teams (which were defined in the same place as the condition that caused the alert), and closed themselves automatically once the underlying condition was no longer marked as failing.

      It worked okay, but a lot would just get ignored as noise still.

      1. 6

        Closing themselves automatically sounds like it would remove the pressure to quench spurious ticket sources. :/

        I guess I’d be fine with tickets automatically closing but only if they’re guaranteed to need human interaction by definition - e.g. if the ticket is “RAID set degraded” then I know it’s not going to get un-degraded unless someone shows up with a spare disk to solve it, so that one’s okay.

        My worry is that a thing like “median site response >35ms” would easily come and go by itself and train operators to ignore it. :(

        1. 3

          My worry is that a thing like “median site response >35ms” would easily come and go by itself and train operators to ignore it. :(

          That is pretty much 100% what happened, yeah. I suspect autoclose was introduced because it got too noisy at some point and competing priorities got in the way of trying to resolve things at their root.

          1. 6

            That’s what almost always happens. My view is that if it being auto-closed is in practice acceptable, then that alert shouldn’t have fired in the first place.

            https://www.robustperception.io/running-into-burning-buildings-because-the-fire-alarm-stopped are some thoughts on alert resolution I wrote a few years back.

            1. 2

              For the “this SSL cert is about to expire in a week” case, do you think it’s a good idea to use a prometheus rule and alertmanager and push it to a ticket system?

              1. 2

                That seems reasonable to me, presuming that it won’t take you a whole week to get a new cert.

                When you’ve only a day of wiggle room left, a page would probably be appropriate.

                1. 2

                  Thanks!

      2. 2

        This is basically how it works at Amazon. A ticket is always created for an issue, severitys 1 and 2 page you, 2.5 pages you during work hours, and 3-5 never pages you. The oncall is generally expected to work on the tickets in the queue while oncall, though depending on the team the backlog stuff may get ignored. The thing is that tickets generally don’t auto resolve, they usually require at least one human interaction.

        At my current job we use opsgenie and sometimes I get paged from a blip in the metrics. Since there’s no ticket generated and it usually doesn’t happen at a convenient time, I don’t have the same tendency to follow up on issues unless they’re bad.

        1. 5

          2.5

          that must have been a fun meeting. “we’re out of numbers.” “let’s go for 2.5!”

          1. 2

            Shame they didn’t go for 2.71828… instead. Then they’d have alert levels 1, 2, e, 3,4 and 5. Tremendously confusing and borderline useless but it would never have stopped being funny.

      3. 7

        Low priority/unnecessary alerts have been repeatedly demonstrated to be bad in almost every environment - why would tech be any better? They’ve caused (indirectly) airline crashes, industrial accidents (including death)

        You get things like dialogue fatigue, subconcious dismissal, actively disabling all alerts (as there is seldom an “important alerts only”).

        Any alerts should be contextually aware - if I’m not working directly on server things then telling me about cert issues a week ahead of time is both useless, and encourages me to disable cert warnings, etc

        1. 4

          To recall an SRE slogan, “alerts should always be actionable.”

          1. 2

            We do alert review every week. Among other things, we delete any alerts that we can’t think of an actual action to take within a few minutes of discussion.