Great article, percentiles are often misunderstood and in addition most client-side libraries have decaying percentiles which are also not what users would expect.
I find that percentiles are highly overrated when you consider the cost versus the benefit. On the client side, percentiles are relatively expensive (e.g. ~600ms under contention with Codahale compared to the ~20ms it has elsewhere). If you’re using the histogram approach lots of bins means lots of data, for example 1000 buckets on 100 servers once a second is going to use up most of the capacity of single beefy Prometheus server, and Prometheus is quite efficient as monitoring systems go.
For real-time monitoring I generally advise to have histograms with 10-20 bins in a handful of places in the entire system, with no further breakouts. For everything else use averages. This lets you see any oddness in latency distribution, while still being cost effective. The averages are good enough to pinpoint most performance problems, and as they’re far cheaper you can have them at every level of the stack.
The averages are good enough to pinpoint most performance problems, and as they’re far cheaper you can have them at every level of the stack.
While, averages are cheaper, they lie like crazy. You really need something else to help you interpret them. For the new version of some stuff I’m doing, I’m storing:
That at least gives me some ability to understand the mean a bit better. One caveat is that I can’t make generalizations about the data (as I don’t know the distribution).
I rely on total&count, the rest share the problems discussed in the article (over what time period are they, and aggregation is not really possible).
When debugging a problem I usually don’t care too much about the distribution, I care that there’s been a 20ms increase in latency in one metric that’s matched by a 18ms latency in another metric. Averages work quite well in that case as a summary statistic as they mean I don’t have to care about the distribution.
All of these can be aggregated though (unless I’m really mistaken)!
Min and max are simple enough. Mean requires that you do a weighted average, factoring in the counts at for each time frame. Variance, is similar to mean here, but basically:
for frame in timeFrames:
v = frame.variance
c = frame.count
variance += v * cnt / sum(f.count for f in timeFrames)
Most monitoring systems will either average or choose one point, so min and max will be incorrect generally.
My stats in this area is rusty, but you seem to be assuming independence when combining the variances which I don’t think is a safe assumption when applied to different chunks of one time series.
Advantage me, since I’m building this one. But, yes, in general, you are totally correct. Unless the system supports first class max and min, then they’ll average, or pick one, or something.
I suck at stats, so you may be completely right. But, I don’t see why you couldn’t assume they are independent, as they are two different time frames (samples) from a larger bucket of data. I actually struggled with this question a few weeks ago, but think I convinced myself that it’s OK. Would love to be shown wrong though.
I don’t see why you couldn’t assume they are independent, as they are two different time frames (samples) from a larger bucket of data.
That would imply to me that they aren’t independent as they’re from the same population.
I’d suggest recording the sum of squares of your data instead of the variance. From that, the count and the sum you can calculate the variance over any period.
Hmm!!! Ok, that makes sense! Thanks for bringing this to my attention. It’s very much appreciated!
While we’re talking about statistics and outliers, could I attempt to persuade someone to implement the fast medcouple for Python?
I wrote most of that Wikipedia article in hopes of getting the medcouple into statsmodels.
I dont understand. If you wrote the fast version in Python already, and licensed it under the GPL, as the copyright holder, you could authorize / license the same code to statsmodels under whatever license they wish / you choose. That’s your choice.
I’m guessing you are resisting because you, like me, want to keep the GPL alive, and if there’s a BSD licensed copy of your work, why would anyone contribute to the GPL?
But, asking someone to do the work again, while you are sitting on yours… well, I don’t know how I feel about it (I could go either way, honestly :) ).
My work is derived from GPL'ed code. It is not up to me to write a version of the same code under a difference license. And the way copyright works, it’s probably not safe to trust me to write the same algorithm under a difference license. There’s a reason why clean-room reverse engineering is done. If you were to just read someone else’s source code and try to replicate it from memory, you’re not safe.
Essentially, I did that. I studied someone else’s description of the algorithm and from that I wrote a spec. Now it is up to someone else to implement the spec I wrote. This is what we know legally works. This is also why we demand that Octave contributors do not read Matlab source code.
While I do favour the GPL, at the moment I want to get across statsmodel’s pigheadedness against copyleft.
My work is derived from GPL'ed code.
Ah. I thought you wrote a clean room version and released it under the GPL. If that were the case, you could say, “hey! I’m releasing this to statsmodel under the terms of the BSD license.” As the copyright holder, you can do that. But, if you are not the original copyright holder, then that of course doesn’t work.
Sorry for my confusion.
As an aside, there’s no legal reason (though, IANAL) to believe that you can’t write another version under a different license, just because you’ve read GPL’d code before. If that were the case, then it would be impossible for me to take another job in the same industry after leaving one. Imagine I work for company X writing monitoring solutions internally. This is proprietary code. I leave and go to work at another company Y building monitoring systems. Obviously I’m “tainted” now, but it’s my responsibility, legally, to attempt to separate myself from the previous proprietary code. I can’t use code I wrote directly at the previous job (so long as it is proprietary), but I can reimplement algorithms and techniques that I used, so long as they are not patented, or trade secrets. This gets a little fuzzy depending on the industry.
This gets a little fuzzy depending on the industry.
We may be a tad overzealous about how cautious we are with this and Octave (and that’s good, because a single lawsuit against the Mathworks could destroy us), but I, personally, don’t feel like it’s fair to refer to GPL'ed R code, reimplement it in Python or C++, and then putting that into statsmodels without the GPL.
personally, don’t feel like it’s fair to refer to GPL'ed R code,
Nor do I! But, I do think it’s fair to have looked at that code previously, and without referring to it, reimplement it based on the whitepaper, or a description of the algorithm from a book, or what not. This is my core point.
(Aside: It’s amazing how many different ways we can agree)
to have looked at that code previously, and without referring to it,
That’s the thing, I don’t think this is possible. My mind is already deeply affected by what I read there. The structure of the code, the variables; whatever I write afterwards would look a lot like that version of the code. I have also quite publicly declared that I’ve read the R code. I’d rather just play it safe and be able to say that whatever version ends up in statsmodels was cleanly reverse engineered.