I’m not sure I 100% agree with all of the analogies completely, but they’re all quite plausible. I think the more useful comparison is to think of a CPU’s ISA the same way that you think of a compiler IR. It is not the thing that’s executed, it’s the thing that you lower to the thing that it executed.
In particular, architectural registers are just a dense encoding of data flow over small blocks. Most CPUs now do register renaming (low-end microcontrollers don’t, big ones do, all of Arm’s A-profile cores now do), so their internal representation is SSA form: every rename register is assigned to once and then read multiple times, the identifier is reused once you reach a point where there are no outstanding references to an SSA value. This happens when a write to the same architectural register leaves speculation and is why xoring a register with itself before a jump can improve performance on x86 (you mark the SSA value as dead). I find it slightly depressing that we use a fixed register set as an IR in between two SSA IRs, but none of the projects that I’m aware of to fix this (including two that I worked on) have managed to come up with anything more efficient.
Some CPUs are a lot more compiler-like. NetBurst and newer Intel chips have loop caches that let them compile a hot loop to micro-ops (not sure if they optimise this) and then execute it directly. Transmeta CPUs and NVIDIA’s Project Denver cores expose an AArch64 ISA but internally JIT it to a VLIW architecture (the Denver architecture is really fun, it has multiple pipelines offset by one cycle so each instruction in a VLIW bundle can feed the result to the next one without allocating a rename register). The big change from Crusoe to Denver was adding a simple map that translated one Arm instruction to one VLIW instruction so that the JIT didn’t have to warm up to execute, only to execute fast.
In particular, architectural registers are just a dense encoding of data flow over small blocks.
This really blew my mind when I first understood this. I’ll unpack this statement a bit more for those interested (might not be the best explanation as I just came up with a random example, but hopefully it helps someone):
Modern CPU’s internally have many more registers than the ISA specifies. The ISA registers are just used to encode the dependencies, internally different registers may be used if the CPU detects that operations on the same register are independent.
For example, if you do something like (Intel syntax):
My understanding is that modern CPU’s interpret this as two “flows”.
First one:
Add 15 to the internal register that rax refers to (this is a dynamic mapping)
Store the contents of the internal register that rax refers to at the address of the internal register that edi refers to
Second one:
Re-assign the label rax to the internal register that has the label rdx
Store the contents of the internal register that rax refers to at the address of the internal register that esi refers to
So, the nice thing is that once the CPU detects that these are independent (since the contents of the rax in the second flow do not depend on the first flow – it is reset by the mov), the second flow can actually be executed before the first flow.
the second flow can actually be executed before the first flow
Sort of—the notion of ‘before’ is a bit tenuous when a CPU is massively concurrent on many levels. To begin with, the writes might in fact be executed concurrently, neither strictly before or after the other (they almost certainly will be—latency of memory accesses is pretty high; but even ignoring that, on a modern uarch, they could both retire on the same cycle). Another issue is that edi and esi might be equal, denoting the same location (or they might partially overlap); in this case, you would have to make sure that the last write the program performs really does come last. (In particular, if you already know what esi is, but you don’t yet know what edi is, then you want to start the write to esi right away, but you need to be able to back out later if edi turns out to have been equal to it.) Finally, there are what we more usually think of as concurrency issues: multiprocessing/cache coherency. What if we write to esi ‘first’, and then another processor reads from esi, and from edi, and it sees the write to esi but not to edi? Under some architectures, that is legal, but not x86.
newer Intel chips have loop caches that let them compile a hot loop to micro-ops (not sure if they optimise this) and then execute it directly
What do you mean by ‘optimise’?
the Denver architecture is really fun, it has multiple pipelines offset by one cycle so each instruction in a VLIW bundle can feed the result to the next one without allocating a rename register
Cuute—no bypass network?
Aside: now I wonder if there’s any parallel to be drawn between tricolour and MESI.
Intel does both “macro” and “micro” uop fusion. Macro fusion is merging multiple instructions into a single uop. This is done for cmp+jcc combos. Micro uop fusion stores multiple uops together as one until the allocation stage; this is usually done for instructions with memory sources, where the read-from-memory uop and the operation uop are kept together throughout the frontend. This saves space and bandwidth in the cache.
But there’s yet another shortcut for short loops (up to ~50 uops), the loop stream detector, where the uops are directly recycled from the uop queue, bypassing even the cache. This has mostly the same throughput as the uop cache, though; the main purpose here is to save power by turning the decoders off when in a short hot loop.
This intuitively seems like it would be a bad thing for power efficiency, particularly to the extent that the CPU spends energy on speculation that turns out to be unneeded. To what extent is that the case?
Power and throughput are always a tradeoff, made more complicated by the fact that the best thing for power is often to run quickly and then enter a low-power state. In-order cores with no speculation don’t spend power doing the wrong thing, but they do spend power doing nothing. Whenever a pipeline is stalled doing nothing, you’re still powering the registers and you often don’t have long enough to power gate it, so it’s still consuming power. Architectures like the UltraSPARC Tx series avoid some of this by having multiple threads running, so each one can issue an instruction and when it’s stalled (waiting for memory or results) another runs, but this increases the power that you need for register files.
If you wanted to design a CPU purely for efficiency (with no constraints on programming model) then you would probably build a multithreaded stack machine, so you have a small amount of per-thread state and can always issue one instruction from one of your threads. Such a machine would be a good fit for Erlang-like languages but would be really bad at running C.
If you wanted to design a CPU purely for efficiency (with no constraints on programming model) then you would probably build a multithreaded stack machine
Not being very close to that level of programming, the way I understood this was:
The compiler has an SSA-like representation, and then does a topological sort to emit say valid, linear x86 code
The CPU takes the linear x86 code, and basically tries to do a reverse topological sort (at runtime) to recover the parallelism
This seems crazily inefficient, but not too surprising when you’re talking about the software/hardware boundary
I had read about the 1980’s dataflow architectures many years ago … which use data dependencies rather than a linear ordering.
It seems like an obviously good idea (?), but narrow waist inertia means that it’s more economical to create this “compatibility layer” than to try to rewrite the interface, like Mill was trying to do
Dynamic scheduling is not a worse-is-better concession to backwards compatibility; it is a way of dealing with the fact that the behaviour of software is dynamic. Particularly, along two axes: memory access, and control. A priori optimal scheduling is not possible given the unpredictable latency of a memory access. And whereas, at the source level some term may join the two arms of a branch, at runtime, only one arm will be taken at any given time, and the CPU may simply attach the dataflow edge to its final destination. Not that it mightn’t be worthwhile to devise an architecture which is easier to decode and schedule, but it simply wouldn’t be a very big deal.
In other words: you’re not simply recovering the parallelism as it was at compile time; you’re finding parallelism where it exists only at runtime.
Languages like c are far more actively harmful, performance-wise, because at the level of applications (rather than the concern of a single cpu), you want to parallelise along other axes—particularly multiprocessing, but also vectorisation—at which scale it becomes untenable to do things speculatively or dynamically, but the implementation in your source language is (of necessity) missing all the parallelism inherent to the algorithm it implements.
Yes, some things you only know at runtime in the CPU. But I’d still be surprised if the CPU doesn’t have to recover info that the compiler already has computed.
And ditto for C being mostly unable to let you encode data dependencies / partial order, and instead forcing a total order.
I’m sure someone has tried to measure this at some point – how much parallelism in a x86 or C program is static vs. how much is dynamic. I’d be very interested in that analysis.
I have worked on more than one EDGE architecture in the last few years. The concept is beautiful, but there were sufficient problems that neither ended up shipping (which means I probably can’t say much about the, in public, unfortunately: Doug Berger gave an ISCA keynote about one of the, a few years back, so at least it’s existence isn’t secret). The problems come down to modelling conditional control flow. In C code, there’s an average of one branch every seven instructions. An efficient encoding needs to handle dataflow between basic blocks well. If you handle dataflow within and between basic blocks with different mechanisms then you end up needing more hardware than just doing register rename.
It is kinda, but one lesson of the past couple decades has been that compact code often wins over orthogonal / “makes sense”, because bigger code blows out cache and runs into fetch bottlenecks. Pretending you have less registers than you do saves bits over making the true number of registers addressable, even counting the movs that would otherwise be unnecessary. And pretending everything is linear and recovering the parallelism in hardware saves bits over actually encoding the parallelism.
CPUs definitely have compiler-like functionality built in now, but the first line of the post is more accurate:
Strictly speaking, a CPU is an interpreter
In every sense of the word, a CPU is an interpreter of machine code. That’s why operational semantics is one of the most successful ways of describing a language semantics, it’s sometimes even referred to as defining the semantics via interpreter.
Starting with x86, we might argue that CPUs are JIT compilers from a register machine to a RAM machine; operations on virtual registers are transformed into commands for a memory controller.
“Strictly speaking, a CPU is an interpreter” is a part that would have blown my mind growing up with in-order CPUs. It appears to be a fundamental thing, not just a historical quirk: processors rely on a JIT-like process of observing actual execution, rather than an up-front analysis of the code itself.
I’m almost sure sure if you’d asked to predict what a highly parallel processor would look like before I knew anything about OoO, I’d’ve imagined something closer to Itanium: a compiler analyzes code a ton to find what instructions can run together. Maybe I’d’ve speculated about some other approach (SIMD?), but definitely not today’s implied, dynamically-figured-out parallelism within one stream of instructions.
And then it turns out there tons of performance challenges that are hard to solve by up-front analysis but feasible to tackle by watching execution (including values and path taken through the code), and often making a guess and being able to recover when it turns out you were wrong:
To figure out if two instructions with memory references depend on each other, you might need to know if two pointer values turned out to be the same; that might be very hard to figure out up front, but you can observe whether the pointers are the same if you’re deciding parallelism at runtime.
Bounds checking involves a ton of branches that are almost never (or always) taken, but completely explode your security model if you don’t have them; branch prediction, even versions of it much simpler than you’d find in a fast chip today, makes that relatively cheap by just watching if the branch is taken a lot.
Indirect branches, needed for any sort of dynamic language where you can call obj.foo() on any kind of obj, would normally have to stall the processor ’til it can be completely sure where that pointer will lead and what instructions it will run; indirect branch prediction can often help a lot.
It’s not OoO-specific, but even cache hierarchies have a little bit of this dynamic flavor: deal with the slowness of main memory, without complicating the code with explicit data movement between types of RAM, by observing patterns code typically follows.
I agree, and had some conversations years ago to the effect of “the CPU has a JIT compiler” (or at least Intel CPUs do)
I think what gave me that viewpoint was watching some talks about the Mill CPU architecture many years ago (an ambitious clean slate project which seems to have run out of steam)
As far as I remember, the observation was that most of the power on the chip is taken up by the “JIT compiler” in the CPU. It’s not executing your instructions, it’s all this “meta” work of figuring out what to execute.
They wanted to drastically reduce power usage of CPUs, like 50% or 90%
(Reminds me that I saw a talk that said on Apple M1, like 99% of instructions executed are purely speculative??? Seems incredible, but makes sense, given memory latency.)
So Mill wanted to move most of that into LLVM, doing it more statically. And they had a big fork of LLVM for that reason. I think they had just as many software/toolchain engineers as CPU engineers, or more.
In other words, the project was straddling the compiler / CPU boundary and trying to rewrite it.
I really appreciate clean slate ideas like that … Every few decades, it might be a good idea to clean up a lot of accumulated crap.
Also I think there were some pretty big flaws with regard to compatibility … like it wasn’t a single ISA but one with a ton of parameters? They had low and and high end, but you couldn’t take a binary from a low end machine and put it on the high end one? You had to recompile it. I don’t remember all the details.
As of two months ago, the news is that the Mill team is seeking fundraising as they pivot from bootstrapped R&D firm to productive salary-paying startup. Rumor is that they generally needed to secure patents first, and funding second, before making any announcements.
Rumor is that they generally needed to secure patents first, and funding second, before making any announcements.
That’s not a rumor, it was explicitly stated in their videos and forums multiple times. The whole reason they announced anything was due to changes in patent law that gave them an advantage in doing so. Their “funding” model up until the patents went through had been largely sweat equity. That meant things went slow for a long time, but it maximized their bargaining position later.
I’m not sure I 100% agree with all of the analogies completely, but they’re all quite plausible. I think the more useful comparison is to think of a CPU’s ISA the same way that you think of a compiler IR. It is not the thing that’s executed, it’s the thing that you lower to the thing that it executed.
In particular, architectural registers are just a dense encoding of data flow over small blocks. Most CPUs now do register renaming (low-end microcontrollers don’t, big ones do, all of Arm’s A-profile cores now do), so their internal representation is SSA form: every rename register is assigned to once and then read multiple times, the identifier is reused once you reach a point where there are no outstanding references to an SSA value. This happens when a write to the same architectural register leaves speculation and is why
xor
ing a register with itself before a jump can improve performance on x86 (you mark the SSA value as dead). I find it slightly depressing that we use a fixed register set as an IR in between two SSA IRs, but none of the projects that I’m aware of to fix this (including two that I worked on) have managed to come up with anything more efficient.Some CPUs are a lot more compiler-like. NetBurst and newer Intel chips have loop caches that let them compile a hot loop to micro-ops (not sure if they optimise this) and then execute it directly. Transmeta CPUs and NVIDIA’s Project Denver cores expose an AArch64 ISA but internally JIT it to a VLIW architecture (the Denver architecture is really fun, it has multiple pipelines offset by one cycle so each instruction in a VLIW bundle can feed the result to the next one without allocating a rename register). The big change from Crusoe to Denver was adding a simple map that translated one Arm instruction to one VLIW instruction so that the JIT didn’t have to warm up to execute, only to execute fast.
This really blew my mind when I first understood this. I’ll unpack this statement a bit more for those interested (might not be the best explanation as I just came up with a random example, but hopefully it helps someone):
Modern CPU’s internally have many more registers than the ISA specifies. The ISA registers are just used to encode the dependencies, internally different registers may be used if the CPU detects that operations on the same register are independent.
For example, if you do something like (Intel syntax):
My understanding is that modern CPU’s interpret this as two “flows”.
First one:
rax
refers to (this is a dynamic mapping)rax
refers to at the address of the internal register thatedi
refers toSecond one:
rax
to the internal register that has the labelrdx
rax
refers to at the address of the internal register thatesi
refers toSo, the nice thing is that once the CPU detects that these are independent (since the contents of the
rax
in the second flow do not depend on the first flow – it is reset by themov
), the second flow can actually be executed before the first flow.Sort of—the notion of ‘before’ is a bit tenuous when a CPU is massively concurrent on many levels. To begin with, the writes might in fact be executed concurrently, neither strictly before or after the other (they almost certainly will be—latency of memory accesses is pretty high; but even ignoring that, on a modern uarch, they could both retire on the same cycle). Another issue is that edi and esi might be equal, denoting the same location (or they might partially overlap); in this case, you would have to make sure that the last write the program performs really does come last. (In particular, if you already know what esi is, but you don’t yet know what edi is, then you want to start the write to esi right away, but you need to be able to back out later if edi turns out to have been equal to it.) Finally, there are what we more usually think of as concurrency issues: multiprocessing/cache coherency. What if we write to esi ‘first’, and then another processor reads from esi, and from edi, and it sees the write to esi but not to edi? Under some architectures, that is legal, but not x86.
Right, good points! So indeed it’s probably not a good example. I hope it still illustrates the point somewhat.
The entire chain was useful, and there’s at least one appreciative thank you here :)
What do you mean by ‘optimise’?
Cuute—no bypass network?
Aside: now I wonder if there’s any parallel to be drawn between tricolour and MESI.
At the very least, I wouldn’t be surprised if they did some micro-op fusion before caching the loops.
Intel does both “macro” and “micro” uop fusion. Macro fusion is merging multiple instructions into a single uop. This is done for cmp+jcc combos. Micro uop fusion stores multiple uops together as one until the allocation stage; this is usually done for instructions with memory sources, where the read-from-memory uop and the operation uop are kept together throughout the frontend. This saves space and bandwidth in the cache.
But there’s yet another shortcut for short loops (up to ~50 uops), the loop stream detector, where the uops are directly recycled from the uop queue, bypassing even the cache. This has mostly the same throughput as the uop cache, though; the main purpose here is to save power by turning the decoders off when in a short hot loop.
This intuitively seems like it would be a bad thing for power efficiency, particularly to the extent that the CPU spends energy on speculation that turns out to be unneeded. To what extent is that the case?
Power and throughput are always a tradeoff, made more complicated by the fact that the best thing for power is often to run quickly and then enter a low-power state. In-order cores with no speculation don’t spend power doing the wrong thing, but they do spend power doing nothing. Whenever a pipeline is stalled doing nothing, you’re still powering the registers and you often don’t have long enough to power gate it, so it’s still consuming power. Architectures like the UltraSPARC Tx series avoid some of this by having multiple threads running, so each one can issue an instruction and when it’s stalled (waiting for memory or results) another runs, but this increases the power that you need for register files.
If you wanted to design a CPU purely for efficiency (with no constraints on programming model) then you would probably build a multithreaded stack machine, so you have a small amount of per-thread state and can always issue one instruction from one of your threads. Such a machine would be a good fit for Erlang-like languages but would be really bad at running C.
I wrote a bit about this almost a decade ago: There’s no such thing as a general-purpose processor
This reminded me of Forth and GreenArrays, which then reminded me of this post by Yossi Kreinin about Forth and stack machines, and the contortions needed to implement MD5 on a GreenArrays chip.
Not being very close to that level of programming, the way I understood this was:
This seems crazily inefficient, but not too surprising when you’re talking about the software/hardware boundary
I had read about the 1980’s dataflow architectures many years ago … which use data dependencies rather than a linear ordering.
It seems like an obviously good idea (?), but narrow waist inertia means that it’s more economical to create this “compatibility layer” than to try to rewrite the interface, like Mill was trying to do
Dynamic scheduling is not a worse-is-better concession to backwards compatibility; it is a way of dealing with the fact that the behaviour of software is dynamic. Particularly, along two axes: memory access, and control. A priori optimal scheduling is not possible given the unpredictable latency of a memory access. And whereas, at the source level some term may join the two arms of a branch, at runtime, only one arm will be taken at any given time, and the CPU may simply attach the dataflow edge to its final destination. Not that it mightn’t be worthwhile to devise an architecture which is easier to decode and schedule, but it simply wouldn’t be a very big deal.
In other words: you’re not simply recovering the parallelism as it was at compile time; you’re finding parallelism where it exists only at runtime.
Languages like c are far more actively harmful, performance-wise, because at the level of applications (rather than the concern of a single cpu), you want to parallelise along other axes—particularly multiprocessing, but also vectorisation—at which scale it becomes untenable to do things speculatively or dynamically, but the implementation in your source language is (of necessity) missing all the parallelism inherent to the algorithm it implements.
Yes, some things you only know at runtime in the CPU. But I’d still be surprised if the CPU doesn’t have to recover info that the compiler already has computed.
And ditto for C being mostly unable to let you encode data dependencies / partial order, and instead forcing a total order.
I’m sure someone has tried to measure this at some point – how much parallelism in a x86 or C program is static vs. how much is dynamic. I’d be very interested in that analysis.
I have worked on more than one EDGE architecture in the last few years. The concept is beautiful, but there were sufficient problems that neither ended up shipping (which means I probably can’t say much about the, in public, unfortunately: Doug Berger gave an ISCA keynote about one of the, a few years back, so at least it’s existence isn’t secret). The problems come down to modelling conditional control flow. In C code, there’s an average of one branch every seven instructions. An efficient encoding needs to handle dataflow between basic blocks well. If you handle dataflow within and between basic blocks with different mechanisms then you end up needing more hardware than just doing register rename.
It is kinda, but one lesson of the past couple decades has been that compact code often wins over orthogonal / “makes sense”, because bigger code blows out cache and runs into fetch bottlenecks. Pretending you have less registers than you do saves bits over making the true number of registers addressable, even counting the movs that would otherwise be unnecessary. And pretending everything is linear and recovering the parallelism in hardware saves bits over actually encoding the parallelism.
CPUs definitely have compiler-like functionality built in now, but the first line of the post is more accurate:
In every sense of the word, a CPU is an interpreter of machine code. That’s why operational semantics is one of the most successful ways of describing a language semantics, it’s sometimes even referred to as defining the semantics via interpreter.
Starting with x86, we might argue that CPUs are JIT compilers from a register machine to a RAM machine; operations on virtual registers are transformed into commands for a memory controller.
Sure, but the RAM machine has to execute instrutions at some point, and that “execution” is interpretation.
“Strictly speaking, a CPU is an interpreter” is a part that would have blown my mind growing up with in-order CPUs. It appears to be a fundamental thing, not just a historical quirk: processors rely on a JIT-like process of observing actual execution, rather than an up-front analysis of the code itself.
I’m almost sure sure if you’d asked to predict what a highly parallel processor would look like before I knew anything about OoO, I’d’ve imagined something closer to Itanium: a compiler analyzes code a ton to find what instructions can run together. Maybe I’d’ve speculated about some other approach (SIMD?), but definitely not today’s implied, dynamically-figured-out parallelism within one stream of instructions.
And then it turns out there tons of performance challenges that are hard to solve by up-front analysis but feasible to tackle by watching execution (including values and path taken through the code), and often making a guess and being able to recover when it turns out you were wrong:
obj.foo()
on any kind ofobj
, would normally have to stall the processor ’til it can be completely sure where that pointer will lead and what instructions it will run; indirect branch prediction can often help a lot.It’s not OoO-specific, but even cache hierarchies have a little bit of this dynamic flavor: deal with the slowness of main memory, without complicating the code with explicit data movement between types of RAM, by observing patterns code typically follows.
I agree, and had some conversations years ago to the effect of “the CPU has a JIT compiler” (or at least Intel CPUs do)
I think what gave me that viewpoint was watching some talks about the Mill CPU architecture many years ago (an ambitious clean slate project which seems to have run out of steam)
https://www.youtube.com/channel/UCKdGg6hZoUYnjyRUb08Kjbg
As far as I remember, the observation was that most of the power on the chip is taken up by the “JIT compiler” in the CPU. It’s not executing your instructions, it’s all this “meta” work of figuring out what to execute.
They wanted to drastically reduce power usage of CPUs, like 50% or 90%
(Reminds me that I saw a talk that said on Apple M1, like 99% of instructions executed are purely speculative??? Seems incredible, but makes sense, given memory latency.)
So Mill wanted to move most of that into LLVM, doing it more statically. And they had a big fork of LLVM for that reason. I think they had just as many software/toolchain engineers as CPU engineers, or more.
In other words, the project was straddling the compiler / CPU boundary and trying to rewrite it.
I really appreciate clean slate ideas like that … Every few decades, it might be a good idea to clean up a lot of accumulated crap.
But I’m not surprised that it took decades and didn’t pan out, because of the inertia of the narrow waist.
Also I think there were some pretty big flaws with regard to compatibility … like it wasn’t a single ISA but one with a ton of parameters? They had low and and high end, but you couldn’t take a binary from a low end machine and put it on the high end one? You had to recompile it. I don’t remember all the details.
As of two months ago, the news is that the Mill team is seeking fundraising as they pivot from bootstrapped R&D firm to productive salary-paying startup. Rumor is that they generally needed to secure patents first, and funding second, before making any announcements.
That’s not a rumor, it was explicitly stated in their videos and forums multiple times. The whole reason they announced anything was due to changes in patent law that gave them an advantage in doing so. Their “funding” model up until the patents went through had been largely sweat equity. That meant things went slow for a long time, but it maximized their bargaining position later.