I used this paper’s technique in a wireless sensor network research project (embedded hardware, very limited RAM) back in like 2006 or so. So, a fun trip down memory lane seeing it posted here! :)
Is there nothing that builds upon this idea since? Surely stats dashboards (StatsD / Graphite, etc.) would want to use something like this I would think.
We show
that a very good estimate of any quantile can be obtained by locating points at the minimum, the maximum, and at the current estimates of (p/2)-, p-, and
(1 + p)/2-quantiles. This divides observations into four
cells whose boundaries (called marker heights) are ad-
justed if necessary using a Piecewise-Parabolic (PP or
P²) formula.
This algorithm lets you calculate percentiles (or whatever quantile) in a constant amount of memory. On the other hand, the naive approach is to capture every data point and calculate the percentiles from that - which means high throughput streams of data will require a lot of memory. This seems to be an old algorithm that’s unfortunately been lost.
I used this paper’s technique in a wireless sensor network research project (embedded hardware, very limited RAM) back in like 2006 or so. So, a fun trip down memory lane seeing it posted here! :)
Is there nothing that builds upon this idea since? Surely stats dashboards (StatsD / Graphite, etc.) would want to use something like this I would think.
I haven’t had a chance to look it over carefully yet, but this one seems to be very similar, though I’m not sure how it compares.
For estimating histograms:
This algorithm lets you calculate percentiles (or whatever quantile) in a constant amount of memory. On the other hand, the naive approach is to capture every data point and calculate the percentiles from that - which means high throughput streams of data will require a lot of memory. This seems to be an old algorithm that’s unfortunately been lost.
I assure you that these algorithms are alive and well in databases.