I know it’s a bit unconventional to link to an issue comment, but this was a fantastic and insightful write-up. Hat tip to Joe Cutler on bsky for making me aware of it.
Typically, the memory heirarchy is organised as a series of different-sized caches with an LRU or LRU-like replacement policy. LRU caches have a well-known performance issue with sequential scans: a single large memset or similar can flush everything out of the cache.
To help with this, some architectures support “non-temporal stores”, which are store instructions that bypass the first few levels of cache, intended for streaming writes (e.g. initialisation) to memory that is not expected to be accessed again soon.
intended for streaming writes (e.g. initialisation) to memory that is not expected to be accessed again soon.
So don’t cache the writes to some LRU cache in L1. But it sounds like it’s actually slower to do so, but maybe it’s not if its actually streaming writes.
CPU caches are designed to optimise for spatial and temporal locality of reference. If every load had to wait for main memory, this would be very slow (stalling the pipeline for at least a hundred cycles even with very fast DRAM). Caches take advantage of two things.
First, that if you’ve accessed some memory, you will (with high probability) access adjacent memory (spatial locality). On Intel chips, for example, even though a cache line is 64 bytes, they fill pairs of cache lines because 128 bytes is the read width of most DRAM and you’ll probably use the other half.
Second, that if you’ve accessed memory recently you will probably access the same memory again soon (temporal locality). Last time I actually ran some traces, I think over 90% of memory accesses were to the stack (which is why most modern architectures aggressively special case stack accesses). Stacks are the extreme case of temporal locality because the frames at the top of the stack are constantly reused. Even if a cache just works for the stack, it will avoid 90% of memory accesses going to DRAM. Fortunately, these patterns show up in a lot of other places too. If you access an object, you’re likely to access it more than once.
Importantly for this discussion, temporal locality often spans loads and stores. If you’ve written some memory, you will often read it again soon. If you’ve read some memory, you will often write it again soon.
Most of the time, these are good heuristics. If you assume that most memory accesses exhibit spatial and temporal locality then you’re going to be right most of the time. Most importantly, you’re going to be right often enough that you’ll see a big speedup.
But sometimes you’ll be wrong. And when you’re wrong, you can be very wrong. For example, if you do a 1 MiB memset and the CPU assumes temporal locality, it will evict 1 MiB of other things from the cache to keep the memset’d range in cache. If you then access some of the objects that were evicted from the cache, it will be slow.
To avoid this, modern ISAs let you provide a non-temporal hint to the store that says ‘this one probably doesn’t exhibit temporal locality’. The simplest implementation of this is to just bypass the cache for these stores. That’s usually a bad idea, so a slightly more complex version puts them in the cache but moves the cache line to the head of the queue for eviction, so the next thing to go into that cache associativity set will evict the non-temporal store. You can also turn on store-through behaviour, so the write out to DRAM has already done by the time you evict the cache line (this can be complex, depending on the cache structure).
It looks as if the Apple CPUs are detecting memset-like patterns and applying the non-temporal heuristics. This is probably a good idea most of the time but, as the comment points out, a very bad idea sometimes.
Hopefully the M5 will include the memset and memcpy instructions from recent Arm specs and so not need this kind of heuristic anymore (except perhaps for legacy code).
If I had to speculate, I would guess that this particular heuristic is tuned for something that Rosetta2 emits for some common x86 sequence, so maybe the memset instructions wouldn’t help.
It’s slower to do, but if you’re not reading any of those values back soon, the entire process is faster (because you don’t have to wait until any of them are done) and doesn’t turn your cache over (and so doesn’t slow down other ops as collateral). If you do read one of those values, though, it sucks.
In this case, it’s a barrier instruction, but similar effect. From the article:
However, certain instructions such as stlr and dmb wait for prior stores to complete before retiring. Since a non-temporal store takes much longer than a normal store to fully complete (it needs to access faraway memory, not just L1 cache), such instructions stall for much longer in a loop containing non-temporal stores.
I know it’s a bit unconventional to link to an issue comment, but this was a fantastic and insightful write-up. Hat tip to Joe Cutler on bsky for making me aware of it.
What’s a non-temporal store?
The article says:
I saw that, but I think I’m just out of my depth.
So don’t cache the writes to some LRU cache in L1. But it sounds like it’s actually slower to do so, but maybe it’s not if its actually streaming writes.
I think I’m just out of my depth on this one.
CPU caches are designed to optimise for spatial and temporal locality of reference. If every load had to wait for main memory, this would be very slow (stalling the pipeline for at least a hundred cycles even with very fast DRAM). Caches take advantage of two things.
First, that if you’ve accessed some memory, you will (with high probability) access adjacent memory (spatial locality). On Intel chips, for example, even though a cache line is 64 bytes, they fill pairs of cache lines because 128 bytes is the read width of most DRAM and you’ll probably use the other half.
Second, that if you’ve accessed memory recently you will probably access the same memory again soon (temporal locality). Last time I actually ran some traces, I think over 90% of memory accesses were to the stack (which is why most modern architectures aggressively special case stack accesses). Stacks are the extreme case of temporal locality because the frames at the top of the stack are constantly reused. Even if a cache just works for the stack, it will avoid 90% of memory accesses going to DRAM. Fortunately, these patterns show up in a lot of other places too. If you access an object, you’re likely to access it more than once.
Importantly for this discussion, temporal locality often spans loads and stores. If you’ve written some memory, you will often read it again soon. If you’ve read some memory, you will often write it again soon.
Most of the time, these are good heuristics. If you assume that most memory accesses exhibit spatial and temporal locality then you’re going to be right most of the time. Most importantly, you’re going to be right often enough that you’ll see a big speedup.
But sometimes you’ll be wrong. And when you’re wrong, you can be very wrong. For example, if you do a 1 MiB memset and the CPU assumes temporal locality, it will evict 1 MiB of other things from the cache to keep the memset’d range in cache. If you then access some of the objects that were evicted from the cache, it will be slow.
To avoid this, modern ISAs let you provide a non-temporal hint to the store that says ‘this one probably doesn’t exhibit temporal locality’. The simplest implementation of this is to just bypass the cache for these stores. That’s usually a bad idea, so a slightly more complex version puts them in the cache but moves the cache line to the head of the queue for eviction, so the next thing to go into that cache associativity set will evict the non-temporal store. You can also turn on store-through behaviour, so the write out to DRAM has already done by the time you evict the cache line (this can be complex, depending on the cache structure).
It looks as if the Apple CPUs are detecting memset-like patterns and applying the non-temporal heuristics. This is probably a good idea most of the time but, as the comment points out, a very bad idea sometimes.
Hopefully the M5 will include the memset and memcpy instructions from recent Arm specs and so not need this kind of heuristic anymore (except perhaps for legacy code).
If I had to speculate, I would guess that this particular heuristic is tuned for something that Rosetta2 emits for some common x86 sequence, so maybe the memset instructions wouldn’t help.
It’s slower to do, but if you’re not reading any of those values back soon, the entire process is faster (because you don’t have to wait until any of them are done) and doesn’t turn your cache over (and so doesn’t slow down other ops as collateral). If you do read one of those values, though, it sucks.
Oh!
They were reading them back. Ok, I guess the example implies that, but I missed it.
In this case, it’s a barrier instruction, but similar effect. From the article: