1. 21

  2. 3

    I looked at that and went “oh, that’s weird, must be a graphite bug”, and continued on.

    That’s an odd perspective…

    1. 7

      Graphite is flaky as hell, in my experience. I have some charts in production right now that are very wonky - look like randomly scattered pixels. The same data goes to newrelic and the graphs there look fine.

      1. 4

        Yeah. I was in the middle of other things at the time (as always), and just glanced over it. I’ve had metrics sit at 0 latency before for other reasons (often in the metrics stack), and it was a dumb mistake ;) That’s most of the point of the post though - missing simple things like that can lead to nasty problems down the line. These days I keep a list in a text file under “incuriosity killed the infrastructure”, and try and write things down in there if I notice something even slightly odd.

        1. 2

          It’s extremely healthy to question the accuracy of your time-series data, and try to prove it wrong. 0 values are not indicative of anything other than the fact that the front-end has not been able to draw any pixels higher than that. This may be because the backing DB is having transient issues with write volume. This may be because you value reliability of your metrics over best-case latency and your machines buffer outbound requests in a local persistent queue before forwarding them to the metrics system, and your implementation adds a bit of latency. Maybe you have an aggregation layer before it hits the metrics system, and your aggregation window isn’t up yet.

          It’s not uncommon for users of metrics systems to learn to ignore the leading edge. There’s a ton of skew for how long this may be. I guess it becomes “an odd perspective” when it’s been 0 for longer than you’ve learned to discount the system, and the author definitely discounts the system pretty heavily.

          1. 1

            Yeah, the story seemed mostly reasonable until he ignored the flat 0 successful hits to the database.