In another lifetime, I implemented a similar idea in an embedded computing project. I used the approach from a 1985 paper The P2 algorithm for dynamic calculation of quantiles and histograms without storing observations by R. Jain and I. Chlamtac, which tracks only a few “markers” for different quantiles, which are updated as new data comes in. It worked nicely on a embedded device with very little memory.
Correct me if I’m wrong, but the main reason for keeping track of percentiles is that the types of metrics programs typically collect are hardly ever normally distributed. If you have a normal distribution, then it can be perfectly described by the mean and variance and there’s not much point in saving more. If it’s not a normal distribution (I hardly ever see these in metric collection), then you need to keep quantiles (medians, percentiles) to help yourself understand the type and properties of the distribution it is.
Assuming the data is normally distributed and using the variance to help efficiently store percentiles seems unlikely to help in that scenario.
You’re very right. If you know everything about the distribution, you do not need to “collect” the percentiles as you can easily calculate them. However, in our case, the distribution is unknown. Yet, it looks normal, which is why we used this technique.
I liked the article! I did think that the discussion involving decaying maxima got a bit dodgy. The decaying maximum looks kludgey and while it may be perfectly good for some purposes, it is unlikely to have nice general properties (say your data is very bursty) and may not interface nicely with other statistics you may want to do on the stream.
Indeed, it’s very “experimental” and probably very scary to any “real” methematician. However, we found that it’s an good enough approximation for the cases where we want to observe how a data stream evolves.