1. 6

This is what my team mates have been working on the past few months, and it’s really very neat. The short pitch is that this allows you to aggregate statsd metrics with an agent running on localhost (so you have no UDP packet drops), but it can merge timer histograms meaningfully in a global context: You can meaningfully answer questions like “what is the p90 speed of an API method across the entire fleet of API servers”.


  2. 3

    @antifuchs any reason you went with t-digests instead of HDR histogram? I wrote a bias quantiles implementation a while back for some stuff we were doing, but with our throughput it sucked horribly due to the dynamic memory use. Are t-digests a fixed size no matter what?

    edit it appears so. The only make I see is in the constructor. What advantages over HDR histograms does this have?

    1. 1

      That’s a great question - I asked my co-worker responsible for implementing the t-digest/merging feature, and that’s what he had to say:

      the main reasons were basically 1) needed fixed memory regardless of the number of samples, since we’re shipping the whole histogram around 2) our previous histogram library didn’t support merges and the one before that didn’t support floats, we needed both of those features (and coda’s implementation of hdrhistogram in go also does not support floats 3) lower error at extreme quantiles is a nice feature since those are the ones we really care about (but i admit that my survey of the field was fairly brief) and finally 4) i figured if we were going to write our own histogram implementation we might as well choose a fun one and let’s be honest here… #4 was the most important reason of all

      You can see item 3 analyzed in https://github.com/stripe/veneur/tree/master/tdigest/analysis (provided you have R installed (-:). I think the higher fidelity at the 90/95/99 end is pretty cool (also that you have 100% fidelity for the maxima and minima).

      1. 1

        Ah! Nice. I hadn’t considered the floating point problem. We’re only doing histograms on ints (milliseconds), so that hasn’t been problematic. We, too, ship histograms around, and store them at 1m, 10m and 60m (rolled up of course). Our accuracy though, is probably the thing that suffers most. I don’t remember exactly, but my guess is that we’ve reduced the precision a bit to make the size reasonable.

        For our use case, I actually advocated for doing log-scale response time buckets, but that got vetoed. Thanks for going the extra mile to share the details of the decision!