This is what my team mates have been working on the past few months, and it’s really very neat. The short pitch is that this allows you to aggregate statsd metrics with an agent running on localhost (so you have no UDP packet drops), but it can merge timer histograms meaningfully in a global context: You can meaningfully answer questions like “what is the p90 speed of an API method across the entire fleet of API servers”.
@antifuchs any reason you went with t-digests instead of HDR histogram? I wrote a bias quantiles implementation a while back for some stuff we were doing, but with our throughput it sucked horribly due to the dynamic memory use. Are t-digests a fixed size no matter what?
edit it appears so. The only
makeI see is in the constructor. What advantages over HDR histograms does this have?That’s a great question - I asked my co-worker responsible for implementing the t-digest/merging feature, and that’s what he had to say:
You can see item 3 analyzed in https://github.com/stripe/veneur/tree/master/tdigest/analysis (provided you have R installed (-:). I think the higher fidelity at the 90/95/99 end is pretty cool (also that you have 100% fidelity for the maxima and minima).
Ah! Nice. I hadn’t considered the floating point problem. We’re only doing histograms on ints (milliseconds), so that hasn’t been problematic. We, too, ship histograms around, and store them at 1m, 10m and 60m (rolled up of course). Our accuracy though, is probably the thing that suffers most. I don’t remember exactly, but my guess is that we’ve reduced the precision a bit to make the size reasonable.
For our use case, I actually advocated for doing log-scale response time buckets, but that got vetoed. Thanks for going the extra mile to share the details of the decision!