I was fascinated by this article’s charts, they provide
a rare insight into VM warmup, including some surprising discoveries. It’s a great example of a beautiful use of data visualization to illustrate technical data using simple, elegant and most importantly, insightful charts. More of that, please!
Holy misleading axes. The post-warmup improvements look impressive and significant until you realize that the range is chosen to make it look so. The LuaJIT improvement is about 7 milliseconds, or around 1.2%. The other improvements are on a similar scale.
I don’t think these orders-of-magnitudes even come close to supporting the thesis of the article.
I don’t think this is entirely fair - yes, the differences are not dramatic, but as the article says “We fairly frequently see performance get 5% or more worse over time in a single process execution. 5% might not sound like much, but it’s a huge figure when you consider that many VM optimisations aim to speed things up by 1% at most. It means that many optimisations that VM developers have slaved away on may have been incorrectly judged to speed things up or slow things down, because the optimisation is well within the variance that VMs exhibit.”
I might be a bit more generous if the article would call out the actual differences, rather than just point at misleading graphs – seriously, with an honest y axis it’d be hard to even notice it.
Regardless, while a 5% improvement is not trivial, it is trivial in the context of the article, which is people complaining about slow VMs. That’s in the noise as far as general programming language performance goes.
I’m not sure what you mean. Before I answer, can you clarify what you mean by “input” here? Input to the benchmarking programs (e.g. CLI arguments), or the benchmark programs as inputs to the VMs?
I refer to the work put in to make sure the vms didn’t optimize code away into constants or other such “cheating” optimizations. I suspect most real programs do stumble into readily optimized paths, then don’t change over there course of running.
That’s an open question which I wouldn’t be able to answer without doing another (lengthy) experiment.
But if you were to ask for my gut feeling: yeah, depending on the program, some code paths might be more readily optimisable. But on the other hand I suspect that real programs:
Are much larger than your typical CLBG benchmark.
Are less deterministic than our (deterministic) benchmarks and the path taken through the CFG is likely to depend upon (e.g.) RNGs seeded with the time, or stuff read from outside the program itself, like the environment and file descriptors, etc.
I think those factors would negatively impact JIT compliation, and for this reason I suspect that most real-world programs are less likely to stabilise, and when they do, would take much longer.
But, like I said, this is pure speculation and my co-authors may even disagree with me on this.
Could you create a instrumented build of Chromium and measure this in the wild? Perhaps with Alexa top 1k? The problem is getting deep application usage, which would most likely require a human to drive the application.
I was fascinated by this article’s charts, they provide a rare insight into VM warmup, including some surprising discoveries. It’s a great example of a beautiful use of data visualization to illustrate technical data using simple, elegant and most importantly, insightful charts. More of that, please!
Holy misleading axes. The post-warmup improvements look impressive and significant until you realize that the range is chosen to make it look so. The LuaJIT improvement is about 7 milliseconds, or around 1.2%. The other improvements are on a similar scale.
I don’t think these orders-of-magnitudes even come close to supporting the thesis of the article.
I don’t think this is entirely fair - yes, the differences are not dramatic, but as the article says “We fairly frequently see performance get 5% or more worse over time in a single process execution. 5% might not sound like much, but it’s a huge figure when you consider that many VM optimisations aim to speed things up by 1% at most. It means that many optimisations that VM developers have slaved away on may have been incorrectly judged to speed things up or slow things down, because the optimisation is well within the variance that VMs exhibit.”
I might be a bit more generous if the article would call out the actual differences, rather than just point at misleading graphs – seriously, with an honest y axis it’d be hard to even notice it.
Regardless, while a 5% improvement is not trivial, it is trivial in the context of the article, which is people complaining about slow VMs. That’s in the noise as far as general programming language performance goes.
Side note, Unipycation - A Case Study in Cross-Language Tracing looks pretty sweet!
❤️
I guess real world input is more prone to predictability than input designed to not stabilize.
I’m not sure what you mean. Before I answer, can you clarify what you mean by “input” here? Input to the benchmarking programs (e.g. CLI arguments), or the benchmark programs as inputs to the VMs?
I refer to the work put in to make sure the vms didn’t optimize code away into constants or other such “cheating” optimizations. I suspect most real programs do stumble into readily optimized paths, then don’t change over there course of running.
That’s an open question which I wouldn’t be able to answer without doing another (lengthy) experiment.
But if you were to ask for my gut feeling: yeah, depending on the program, some code paths might be more readily optimisable. But on the other hand I suspect that real programs:
I think those factors would negatively impact JIT compliation, and for this reason I suspect that most real-world programs are less likely to stabilise, and when they do, would take much longer.
But, like I said, this is pure speculation and my co-authors may even disagree with me on this.
In any case, more research needed!
Thank you for your work on this! I especially appreciate the effort went through to ensure a clean environment for each run.
Could you create a instrumented build of Chromium and measure this in the wild? Perhaps with Alexa top 1k? The problem is getting deep application usage, which would most likely require a human to drive the application.
I guess it’s possible (you could use selenium to drive the browser).
“prone”?
Having a tendency.
“likely or liable to suffer from, do, or experience something unpleasant or regrettable”
The word is quite negative.
Mildly negative at most, according to Webster’s.