1. 6

  2. 5

    AOL keyword: Duhem-Quine

    1. 1

      Thanks for this!

    2. 2

      For an incident to happen, multiple factors must have contributed to penetrate those layers of defenses that have evolved. I say that with confidence, because if a single event could take your system down, then it never would have made it this far to begin with. That’s why, when you dig into an incident, you’ll always find those multiple contributors.

      I don’t buy this. My system can go down when AWS has an outage. From my perspective it’s a monocausal event. It doesn’t happen often, and it doesn’t seem to be hindering company growth. You can have monocausal events, and they’re not death knells for a business.

      1. 2

        When your system goes down with AWS, the obvious second cause is that you have no fallback system outside of AWS. Not that it helps anyone if it’s impractical to fix, but technically it’s there.

        1. 1

          Yeah that argument occurred to me, and I was going to mention it in my original comment. If there was a mitigation in place that failed, then I would think of it as a part of the cause of the system outage event. But I wouldn’t consider things that don’t exist as causes of an event. So I do think it’s possible to have a monocausal incidents.

          1. 3

            Absense of a safeguard against something is the ultimate case of safeguard failure. In situations when having a safeguard against a known failure mode is a standard, neglecting to implement that safeguard at all definitely would be considered a cause.

            1. 1

              I don’t think people tend to think this of things that didn’t happen as causing an event, but maybe they should in these kind of situations. We might get fewer failures :)

              Going back to the blog post, the argument (which I think is not that clearly expressed, tbh) actually seems to be that successful organisations can only experience multi-causal incidents because if monocausal incidents were possible they would happen all the time, and the organisation wouldn’t survive, and therefore they must all be multi-causal. It seems like circular reasoning, and it also seems to assume that

              1. If monocausal incidents were possible there would therefore necessarily be a lot of them. I’m not sure this is true.
              2. That all incidents are extremely severe; potentially life-threatening to the organisation
              1. 1

                That part I definitely agree with.

                Most incidents are not nearly significant enough to fix the “phantom” second cause since exact same situation may never happen again, or the impact is so low that adding a safeguard just makes no sense.

            2. 1

              Something that doesn’t exist is the cause of an event, because you

              • did not think about it (why not)
              • choose not to implement it (why did you make that choice)

              Ultimately constraints in e.g. money, time, knowledge or similar fundamental things are root causes.

              To take an extreme example: if someone dies of hunger, would you say the ‘absence of food’ is not something you can reasonably call a ‘cause’ of the event?