Warming up your VM and establishing statistical significance are definitely important. Most Haskell implementations have a runtime and produce statically compiled and optimized binaries in lieu of a VM. Criterion is how Haskellers benchmark stuff.
Introductory blog post: http://www.serpentine.com/blog/2009/09/29/criterion-a-new-benchmarking-library-for-haskell/
dons post about it: http://donsbot.wordpress.com/2010/02/23/modern-benchmarking-in-haskell/
Using criterion to benchmark HTTP parsers: http://www.serpentine.com/blog/2014/05/31/attoparsec/
I think bos just recently put up a talk on Youtube about criterion, couldn’t find it though.
https://skillsmatter.com/skillscasts/5466-bryan-o-sullivan (requires sign-up, otherwise free)
This is a good survey talk, a good explanation of how to do benchmarking, especially for the JVM. I think she avoided making it too runtime-specific, which is probably a good tack for this kind of talk.
On the JVM, I’ve found that jmh works quite well. There’s a little more support in general for caliper, but jmh is catching up, and is quite good at forcing you to write good benchmarks. Alexey Shipilev (admittedly biased) provides a good explanation of what jmh has learned from the mistakes of its predecessors in this “JMH vs Caliper” thread on mechanical sympathy. Caliper also doesn’t seem to be actively maintained in OSS (the last OSS change was January), so in general I think it makes more sense to use jmh over caliper, although caliper used to be the industry standard.
She pointed sort of generally at what the problem at 128K keys was, in that there was probably a problem with doing effective caching, because she wasn’t trying to take advantage of cache locality. I wonder if the problem was that Riak started swapping. Presumably it would have been ameliorated because M3 uses local SSDs and not EBS for persistent storage, but it could still be quite painful, given the 50% random reads. It should still generally look like a cache hierarchy slowdown (persistent state can be thought of as just another cache) but the constant multiple increase from having to go to persistent storage is much higher than for RAM. It would be interesting to hear from one of the Riak folks on what they think was going on. (I think a few of you are here–@cmeiklejohn?)
I wonder if the problem was that Riak started swapping.
With the maximum number of keys = 1,024,000 gives us ((22+4) overhead + 4 key size bytes + 10000 value size bytes)*(3/5)=5877 MB total memory used per machine, which is smaller than the RAM size of 8GB. Even if we take into account the expected number of machines accessed per operation (to include temporary copies in the coordinators), that still gives us 6600 MB. So from this I think we can safely conclude that we were not swapping.
Notice the scale on the graph: it was not normalized to 0, and I observed increase of only 200 usec from 4k to 1,024k keys.
I used bitcask in this benchmark, so I’m inclined to think that this was probably due to the block cache churn.