There’s an opposite “paradox” that sometimes comes into play when benchmarking e.g. web site performance. Suppose you do a minute-long test: for the first 15 seconds, each request comes back in 5 milliseconds, and then on the 3000th request, the site freezes for 30 seconds, so the request takes 30 seconds to complete, and then for the remaining 15 seconds, all 3000 requests come back in 5 milliseconds each. The average time of your test requests will be 60 seconds / 6000 = 10 milliseconds. But the average response time of the site during your test, sampled by time instead of by requests, was actually more like 7500 milliseconds.
I forget what this phenomenon is called, but it actually happens commonly in real life.
In this situation where you have the raw timing data, though, you could just as easily generate percentiles that more clearly depict the nuanced behaviour of the server. In cases like these, average / mean often generate nearly meaningless numbers that convey neither general trends nor interesting outliers.
I think you may not have understood the problem I was talking about. It isn’t the use of the mean.
The problem in this case is that you don’t actually know whether requests you could have made during that 30-second window, during which you weren’t making any requests, would have returned immediately, or all been delayed as I hypothesized; and, indeed, it’s easy to write a system that fails to shed load and will never recover from an overload condition like that once the requests start piling up.
Simply taking percentiles of the raw data doesn’t help. In this case, 5999 out of 6000 requests came back within 5 milliseconds. The .999 quantile of the response time is still 5 milliseconds. This paints an even more unjustifiably rosy picture than the mean!
Ah - indeed. Yes, an important scenario. Thank you for clarifying.
The classic lesson here being not to look only at the average, but to remember that one can draw a histogram of latency by percentage of requests, and figure out what it makes sense to look at based on what the percentiles you choose to monitor mean for the user of your particular product.
I think my grandparent comment was super unclear, because both you and @ajisaiko didn’t understand what I was getting at; see my attempted clarification.
Hah, wow. I can’t believe I never thought about this before. Which is the author’s point. :)