I would urge people who care about performance to still be considerate about caching issues. For a great example of how caring about instruction cache can actually help a lot in database workloads, I really enjoy This Scylla post on using top-down analysis to guide effort towards batching of certain activities which reduce instruction cache misses and result in significant performance gains.
There are also several great Mechanical Sympathy posts on possible costs with architecture that hasn’t changed much since these articles were written 9 years ago.
Personally, I’ve measured massive specific component CPU utilization gains while building sled by using additional batching for background processes like cache maintenance, inspired by Ben Manes’s excellent Caffeine library and his high-level performance mantras. By batching LRU maintenance tasks, it caused a near-linear overall CPU cost reduction in proportion to batch size due primarily to instruction cache friendliness in combination with a prefetcher-friendly batch layout. Overall this saved 8% of sled’s CPU cycles for 95% random read 5% random write workloads of 8 byte keys with a 64 byte value, according to perf.
Another thing is that cache friendliness has a lot of benefits that are invisible when benchmarking your own local system. Real programs run in mixed environments where programs and generally non-core code are competing with the core system logic for cache residency. We have a tendency to maximize system utilization when looking at a specific component’s performance, as it makes the numbers we happen to be looking at improve, but at the cost of the performance of other components. This is generally why complex systems tend to get crappier and crappier over time - people will “improve” small aspects by increasing utilization of resources rather than improving efficiency to get the desired gains, causing less resources to be available to other consumers of compute resources, and making the overall system worse in ways that are difficult to detect when monitoring something other than realistic overall workloads on realistic hardware. Aiming for better efficiency instead of higher utilization tends to lead to actually faster real-world systems, but it’s still hard to predict how cache access pattern changes may impact real-world workloads, and it’s possible to reduce cpu cycles and cache misses but still get something slower because the specific memory access patterns reduce the actually-deployed system’s prefetcher effectiveness/write combining effectiveness/contention avoidance.
Edit: oh yeah, and here’s an amazing paper that goes into some considerations for when high-branch (worse for cache) access patterns actually behave better than seemingly cache-friendlier patterns, also accounting for things like cache size etc… using search layouts as the example. This is a really cool paper.
It’s not an issue I have much direct experience with, but I know it’s complicated :) One thing I will say is that source code tends to live a lot longer than CPU microarchitectures. Which is why this kind of thing isn’t exposed at all in C or Rust, etc.
HHVM today is about three times faster than it was a year ago. Then, as now, it spent about 20% of time in the JIT output, and about 80% in the C++ runtime. The great mystery for you to ponder, and I would hope for your book to explain, is that we got three times faster by focusing our optimization efforts on the code that was executing for 20% of the time, not 80% of the time. Go back and reread that.
When you internalize that this sort of thing really happens, in real programs that you’re responsible for, and that we’re not talking about pocket change, but a 3x difference in performance, it is scary.
Learning which functions the CPU cycles are being spent on can actively deceive you about what to optimize. The advice we give intermediate programmers about “premature optimization” and allowing profiling to drive your optimization efforts is, well, untrue. Or rather, it’s a heuristic that helps compensate for even more dangerously wrong optimization impulses.
– Keith Adams
Just got around to reading this chapter. Remarkable! And quite short, in case anyone’s hesitant about reading.
Worse was that normal profiling, even advanced stuff like VTune and the Linux “perf” kernel module, weren’t very useful. They will happily tell you what functions suffer the most cache misses. But they don’t tell you why the data wasn’t there.
After giving me a bit of a scare, this seems to validate my use of perf to figure out whether instruction caching is an issue. Now of course L1 is as big as it is for a reason and it’s possible to overflow it. What I argued and tested was that for interpreters this doesn’t happen. Other kinds of programs could be different. And we’re using JIT compilation already in BQN so it’s good to be warned that this could become such a case in the future. Don’t guess when you can measure. Learn to use perf!
I wish that the author would have focused a bit more on the realization that a language can be implemented in many different ways. Many of their critiques are rooted in the false belief that languages must map certain syntactic features to implementation details.
[K is] an interpreter (if a good one) for a dynamically-typed language, and will be slower than compiled languages like C and Go, or JIT-compiled ones like Javascript and Java.
The paragraph containing this sentence is hash because this sentence is bogus. To see, take Python, and see which group it belongs to. Is it with K and BQN, because it interprets dynamically-typed programs? Yes; CPython does that. Is it compiled like C and Go? Yes, Nuitka does that. Is it JIT-compiled like JavaScript and Java? Yes, PyPy does that.
Note that language comparisons aren’t “one-dimensional”; they don’t have the Archimedean property. This means that we can’t take comparisons between CPython, Nuitka, and PyPy, and use them as gauges to determine which language compilation strategies are best for K.
Yes, I’ve used the names of languages to refer to their best known implementations: CPython, gcc/clang, gc, V8/SpiderMonkey, and OpenJDK or similar. Many readers aren’t going to know all these names and the ones that do should (I thought?) know what’s meant. Is there anything other than this use to suggest I don’t understand the difference? I do, so I’d like to clear this up if possible.
I think I’ve been careful to distinguish between implementations of the languages in question, such as ngn/k versus Whitney’s K. While I can’t check the source code, I say that Whitney’s Ks use interpreters (now updated to bytecode VM specifically) because users and implementers consistently refer to them in this way, and if they were JITting it would be very surprising not to hear anything about it. Documents like this should confirm that I understand the wide range of implementation possibilities.
I appreciate the apology! And at least you prompted me to add a little note to the article about this issue. Language versus implementation is such a common misunderstanding and I really should be pointing it out in anything written for a general audience like this is.
The reason I have no measurements is that every contract for a commercial K includes an anti-benchmark clause. For example, Shakti’s license says users cannot “distribute or otherwise make available to any third party any report regarding the performance of the Software benchmarks or any information from such a report”.
So, it’s automatically not worth using and the creators are gigantic tools. Gotcha.
I agree, but then again .NET and Oracle Java used to come with the same clause (maybe they still do? I don’t know¹). While for me this is yet another reason for not using these technologies, the vast majority of the world is running on .NET/Java and doesn’t seem to care.
¹ Oracle Java and Microsoft SQL Server seem to still have this clause. Unsure about .NET.
Azul is not shy about Zing and Zulu benchmarks. I think that this is a dichotomy between performance-oriented design and performance-chasing design, including in sales and marketing. Oracle forbids benchmarks of their products because of a fundamental market insecurity created by inferior products; they have a mismatch between how they boast in advertisements about their performance, and what they can actually deliver, so they use anti-benchmarking clauses to make up the difference.
To understand the present you must look at the past. .NET Core (the latest incarnation of the .net platform) is MIT licensed, no performance clause. It’s hard to find old licenses for Java, but apparently the JDK was released under the GPL in 2006, and then relicensed(?) when acquired by Oracle.
I would urge people who care about performance to still be considerate about caching issues. For a great example of how caring about instruction cache can actually help a lot in database workloads, I really enjoy This Scylla post on using top-down analysis to guide effort towards batching of certain activities which reduce instruction cache misses and result in significant performance gains.
There are also several great Mechanical Sympathy posts on possible costs with architecture that hasn’t changed much since these articles were written 9 years ago.
Personally, I’ve measured massive specific component CPU utilization gains while building sled by using additional batching for background processes like cache maintenance, inspired by Ben Manes’s excellent Caffeine library and his high-level performance mantras. By batching LRU maintenance tasks, it caused a near-linear overall CPU cost reduction in proportion to batch size due primarily to instruction cache friendliness in combination with a prefetcher-friendly batch layout. Overall this saved 8% of sled’s CPU cycles for 95% random read 5% random write workloads of 8 byte keys with a 64 byte value, according to perf.
Another thing is that cache friendliness has a lot of benefits that are invisible when benchmarking your own local system. Real programs run in mixed environments where programs and generally non-core code are competing with the core system logic for cache residency. We have a tendency to maximize system utilization when looking at a specific component’s performance, as it makes the numbers we happen to be looking at improve, but at the cost of the performance of other components. This is generally why complex systems tend to get crappier and crappier over time - people will “improve” small aspects by increasing utilization of resources rather than improving efficiency to get the desired gains, causing less resources to be available to other consumers of compute resources, and making the overall system worse in ways that are difficult to detect when monitoring something other than realistic overall workloads on realistic hardware. Aiming for better efficiency instead of higher utilization tends to lead to actually faster real-world systems, but it’s still hard to predict how cache access pattern changes may impact real-world workloads, and it’s possible to reduce cpu cycles and cache misses but still get something slower because the specific memory access patterns reduce the actually-deployed system’s prefetcher effectiveness/write combining effectiveness/contention avoidance.
Edit: oh yeah, and here’s an amazing paper that goes into some considerations for when high-branch (worse for cache) access patterns actually behave better than seemingly cache-friendlier patterns, also accounting for things like cache size etc… using search layouts as the example. This is a really cool paper.
Related recent post: https://matklad.github.io/2021/07/10/its-not-always-icache.html
It’s not an issue I have much direct experience with, but I know it’s complicated :) One thing I will say is that source code tends to live a lot longer than CPU microarchitectures. Which is why this kind of thing isn’t exposed at all in C or Rust, etc.
Another example of the benefits of cache optimization is HHVM. The following excerpt is from Chapter 3 of The Mature Optimization Handbook.
Just got around to reading this chapter. Remarkable! And quite short, in case anyone’s hesitant about reading.
After giving me a bit of a scare, this seems to validate my use of perf to figure out whether instruction caching is an issue. Now of course L1 is as big as it is for a reason and it’s possible to overflow it. What I argued and tested was that for interpreters this doesn’t happen. Other kinds of programs could be different. And we’re using JIT compilation already in BQN so it’s good to be warned that this could become such a case in the future. Don’t guess when you can measure. Learn to use perf!
I wish that the author would have focused a bit more on the realization that a language can be implemented in many different ways. Many of their critiques are rooted in the false belief that languages must map certain syntactic features to implementation details.
The paragraph containing this sentence is hash because this sentence is bogus. To see, take Python, and see which group it belongs to. Is it with K and BQN, because it interprets dynamically-typed programs? Yes; CPython does that. Is it compiled like C and Go? Yes, Nuitka does that. Is it JIT-compiled like JavaScript and Java? Yes, PyPy does that.
Note that language comparisons aren’t “one-dimensional”; they don’t have the Archimedean property. This means that we can’t take comparisons between CPython, Nuitka, and PyPy, and use them as gauges to determine which language compilation strategies are best for K.
Yes, I’ve used the names of languages to refer to their best known implementations: CPython, gcc/clang, gc, V8/SpiderMonkey, and OpenJDK or similar. Many readers aren’t going to know all these names and the ones that do should (I thought?) know what’s meant. Is there anything other than this use to suggest I don’t understand the difference? I do, so I’d like to clear this up if possible.
I think I’ve been careful to distinguish between implementations of the languages in question, such as ngn/k versus Whitney’s K. While I can’t check the source code, I say that Whitney’s Ks use interpreters (now updated to bytecode VM specifically) because users and implementers consistently refer to them in this way, and if they were JITting it would be very surprising not to hear anything about it. Documents like this should confirm that I understand the wide range of implementation possibilities.
I stand corrected and humbled. I appreciate this clarification, and apologize for wrongly saying that you did not understand the details of the topic.
I appreciate the apology! And at least you prompted me to add a little note to the article about this issue. Language versus implementation is such a common misunderstanding and I really should be pointing it out in anything written for a general audience like this is.
So, it’s automatically not worth using and the creators are gigantic tools. Gotcha.
I agree, but then again .NET and Oracle Java used to come with the same clause (maybe they still do? I don’t know¹). While for me this is yet another reason for not using these technologies, the vast majority of the world is running on .NET/Java and doesn’t seem to care.
¹ Oracle Java and Microsoft SQL Server seem to still have this clause. Unsure about .NET.
Azul is not shy about Zing and Zulu benchmarks. I think that this is a dichotomy between performance-oriented design and performance-chasing design, including in sales and marketing. Oracle forbids benchmarks of their products because of a fundamental market insecurity created by inferior products; they have a mismatch between how they boast in advertisements about their performance, and what they can actually deliver, so they use anti-benchmarking clauses to make up the difference.
To understand the present you must look at the past. .NET Core (the latest incarnation of the .net platform) is MIT licensed, no performance clause. It’s hard to find old licenses for Java, but apparently the JDK was released under the GPL in 2006, and then relicensed(?) when acquired by Oracle.