1. 9
  1. 3

    I love this kind of post.

    *As for the underlying cause of the incident (or the “root cause” if you insist on using such language), that has to be the fact that our assumptions as teams or individuals are ultimately formed by our past experiences. *

    I didn’t know “root cause” is deprecated. Here’s some links I tracked down about it:



    1. 4

      Thanks! Yeah I’m not a big fan of “root cause analysis” and “the five whys” as post incident review processes. A frank and blameless discussion is easily the best way to understand failure, assuming your team has the necessary levels of psychological safety to allow such conversations to happen

      1. 1

        From my experience those are usually useful with third parties that don’t have the same culture.

        For example, your hosting provider being down several hours without the HA working (hopefully outside of business hours) is justified as “it was a network issue”.

        In this case, you need those processes to get to the bottom of things.

    2. 2


      Prometheus documentation suggests that exporters should perform the work to gather metrics on every scrape, so the way the query exporter works would be in line with ‘best practice’. However there is a caveat that if the metrics are expensive to gather, it should perform the work and cache the results, only presenting the cache on scrape. It appears this would likely fall into the ‘expensive’ camp, although that might not be obvious if you’re testing on smaller databases.

      1. 3

        I’ve seen that in the docs and I think it’s a bad idea - you should never create an exporter that would in itself, with the default prometheus setup, hug your application to death. Or if you really apply to this, make sure you’re caching it internally as to avoid this.

        1. 2

          Yes you’re right, it doesn’t help that it’s all closed sourced as well