Krste is a strong believer in macro-op fusion but I remain unconvinced. It requires decoder complexity (power and complexity), more i-cache space (power), trace caches if you want to avoid having it on the hot path in loops (power and complexity), weird performance anomalies when the macro-ops span a fetch granule and so the fusion doesn’t happen (software pain). And, in exchange for all of this, you get something that you could have got for free from a well-designed instruction set.
Imagine you need to clear the upper word of a variable (var &= 0xFFFFFFFF). This is very common when doing arithmetic on size smaller than the word size (e.g. 32-bit on a 64-bit computer). On RISC-V it is done with two instructions: a shift left followed by a shift right. For example, consider a 16-bit unsigned addition x8 += x9:
add x8, x8, x9
slli x8, x8, 16
srli x8, x8, 16
In the base instruction set, each instruction would require 4 bytes, for a total of 12 bytes. Of course, the compressed instruction set allows these instructions to be encoded as 2 bytes each, for a total of 6 bytes. As a point of comparison, on x86 this would be ADD ax, bx, which is encoded as 3 bytes.
Separate to the encoding size issue is that on a typical architecture, these instructions will take at least 3 cycles to complete. They cannot be rearranged or completed in parallel because of the dependency of each instruction on the result of the previous instruction. One way around this is to just create a new instruction to zero-out the top 16-bits of a register. However, the RISC-V people decided against doing this. Instead, they suggest that CPUs implement macro-op fusion. So the instruction decoder has to recognize that a left shift followed by a right shift of the same amount can be done in one cycle. This is tricky to get right (and no one has done it in silicon AFAIK). It also introduces a lot of complexity to optimization, similar to in-order superscalar cores (see e.g. the pentium 1 section of this guide).
However, there is also the problem that while the above code could execute in 2 cycles with macro-op fusion, it still takes 6 bytes to encode. This takes up extra space in the instruction cache when compared to a dedicated “clear upper word” instruction (4 bytes) or more compact/irregular instruction sets like x86 (3 bytes).
(A way around this could be to cache the decoded instructions. This is not an uncommon technique, but decoded instructions are usually larger than encoded instructions. So it’s unlikely that this can be used to improve instruction cache density.)
Thanks for the elaboration. I have been enlightened. :)
So that particular sequence of instructions would be longer, but assuming that macro-op fusion can be made to work well, is it possible to increase the code density this way overall? For example, although there is not a specific instruction to clear the upper bits, since there are fewer instructions to encode it potentially allows more common instructions to be shorter?
So that particular sequence of instructions would be longer, but assuming that macro-op fusion can be made to work well, is it possible to increase the code density this way overall?
Yes (hopefully). In general, the RISC-V authors have chosen to make RISC-V very RISC-y when compared to other RISC ISAs. This manifests in few instructions, 3-operands for everything, and a very regular encoding in the base ISA. This has been criticized as wasting space in the base ISA, though I don’t have a link to such criticism on hand.
For example, although there is not a specific instruction to clear the upper bits, since there are fewer instructions to encode it potentially allows more common instructions to be shorter?
This is effectively what compressed instructions are for.
The paper linked in the article appears to show RV64GC, the compressed variant of RV64G, results in smaller program sizes than x86_64. If that’s true, wouldn’t that mean you would need less i-cache space? This isn’t my area of expertise, but I find it fascinating.
There are a lot of variables here. One is the input corpus. As I recall, that particular paper evaluated almost exclusively C code. The generated code for C++ will use a slightly different instruction mix, for other languages the difference is even greater. To give a concrete example, C/C++ do not have (in the standard) any checking for integer overflow. It either wraps for unsigned arithmetic or is undefined for signed. This means that a+b on any C integer type up to [u]int64_t is a single RISC-V instruction. A lot of other languages (including Rust, I believe, and the implementations of most dynamic languages) depend on overflow-checked arithmetic on their fast paths. With Arm or x86 (32- or 64-bit variants), the add instructions set a condition code that you can then branch on, accumulate in a GPR, or use in a conditional move instruction. If you want to have a good fast path, you accumulate the condition code after each arithmetic op in a hot path then branch at the end and hit a slow path if any of the calculations overflowed. This is very dense on x86 or Arm.
RISC-V does not have condition codes. This is great for microarchitects. Condition code registers are somewhat painful because they’re an implicit data dependency from any arithmetic instruction to a load of others. In spite of this, Arm kept them with AArch64 (though dramatically reduced the number of predicated instructions, to simplify the microarchitecture) because they did a lot of measurement and found that a carefully optimised compiler made significant use of them.
RISC-V also doesn’t have a conditional move instruction. Krste likes to cite a paper by the Alpha authors regretting their choice of a conditional move, because it required one extra read port on the register file. These days, conditional moves are typically folded into the register rename engine and so are quite cheap in the microarchitecture of anything doing a non-trivial amount of register rename (they’re just an update in the rename directory telling subsequent instructions which value to use). Compilers have become really good at if-conversion, turning small if blocks into a path that does both versions and selects the results. This is so common that LLVM has a select instruction in the IR. To do the equivalent with RISC-V, you need to have logic in decode that recognises a small branch forward and converts it into a predicated sequence. That’s a lot more difficult to do than simply having a conditional move instruction and reduces code density.
I had a student try adding a conditional move to a small RISC-V processor a few years ago and they reproduced the result that Arm used in making this decision: Without conditional moves, you need roughly four times as much branch predictor state to get the same overall performance.
Note, also, that these results predate any vector extensions for RISC-V. They are not comparing autovectorised code with SSE, AVX, Neon, or SVE. RISC-V has used up all of its 16-bit and most of its 32-bit instruction space and so can’t add the instructions that other architectures have introduced to improve code density without going into the larger 48-bit encoding space.
The paper linked in the article appears to show RV64GC, the compressed variant of RV64G, results in smaller program sizes than x86_64
x86-64 has pretty large binary sizes; if your compressed instruction set doesn’t have smaller binaries than it should be redesigned.
I would take the measurements in that paper with a grain of salt; they aren’t comparing like-to-like.
The Cortex A-15 Core benchmarks, for example, should have also been run in Thumb-2 mode.
Thumb-2 causes substantial reductions in code size; it’s dubious to compare your compressed ISA to a competitor’s uncompressed ISA.
Krste is a strong believer in macro-op fusion but I remain unconvinced. It requires decoder complexity (power and complexity), more i-cache space (power), trace caches if you want to avoid having it on the hot path in loops (power and complexity), weird performance anomalies when the macro-ops span a fetch granule and so the fusion doesn’t happen (software pain). And, in exchange for all of this, you get something that you could have got for free from a well-designed instruction set.
Why is more instruction cache needed? Is this lumping in the uop cache with L1i, or…?
Imagine you need to clear the upper word of a variable (
var &= 0xFFFFFFFF
). This is very common when doing arithmetic on size smaller than the word size (e.g. 32-bit on a 64-bit computer). On RISC-V it is done with two instructions: a shift left followed by a shift right. For example, consider a 16-bit unsigned additionx8 += x9
:In the base instruction set, each instruction would require 4 bytes, for a total of 12 bytes. Of course, the compressed instruction set allows these instructions to be encoded as 2 bytes each, for a total of 6 bytes. As a point of comparison, on x86 this would be
ADD ax, bx
, which is encoded as 3 bytes.Separate to the encoding size issue is that on a typical architecture, these instructions will take at least 3 cycles to complete. They cannot be rearranged or completed in parallel because of the dependency of each instruction on the result of the previous instruction. One way around this is to just create a new instruction to zero-out the top 16-bits of a register. However, the RISC-V people decided against doing this. Instead, they suggest that CPUs implement macro-op fusion. So the instruction decoder has to recognize that a left shift followed by a right shift of the same amount can be done in one cycle. This is tricky to get right (and no one has done it in silicon AFAIK). It also introduces a lot of complexity to optimization, similar to in-order superscalar cores (see e.g. the pentium 1 section of this guide).
However, there is also the problem that while the above code could execute in 2 cycles with macro-op fusion, it still takes 6 bytes to encode. This takes up extra space in the instruction cache when compared to a dedicated “clear upper word” instruction (4 bytes) or more compact/irregular instruction sets like x86 (3 bytes).
(A way around this could be to cache the decoded instructions. This is not an uncommon technique, but decoded instructions are usually larger than encoded instructions. So it’s unlikely that this can be used to improve instruction cache density.)
Thanks for the elaboration. I have been enlightened. :)
So that particular sequence of instructions would be longer, but assuming that macro-op fusion can be made to work well, is it possible to increase the code density this way overall? For example, although there is not a specific instruction to clear the upper bits, since there are fewer instructions to encode it potentially allows more common instructions to be shorter?
Yes (hopefully). In general, the RISC-V authors have chosen to make RISC-V very RISC-y when compared to other RISC ISAs. This manifests in few instructions, 3-operands for everything, and a very regular encoding in the base ISA. This has been criticized as wasting space in the base ISA, though I don’t have a link to such criticism on hand.
This is effectively what compressed instructions are for.
The paper linked in the article appears to show RV64GC, the compressed variant of RV64G, results in smaller program sizes than x86_64. If that’s true, wouldn’t that mean you would need less i-cache space? This isn’t my area of expertise, but I find it fascinating.
There are a lot of variables here. One is the input corpus. As I recall, that particular paper evaluated almost exclusively C code. The generated code for C++ will use a slightly different instruction mix, for other languages the difference is even greater. To give a concrete example, C/C++ do not have (in the standard) any checking for integer overflow. It either wraps for unsigned arithmetic or is undefined for signed. This means that
a+b
on any C integer type up to[u]int64_t
is a single RISC-V instruction. A lot of other languages (including Rust, I believe, and the implementations of most dynamic languages) depend on overflow-checked arithmetic on their fast paths. With Arm or x86 (32- or 64-bit variants), the add instructions set a condition code that you can then branch on, accumulate in a GPR, or use in a conditional move instruction. If you want to have a good fast path, you accumulate the condition code after each arithmetic op in a hot path then branch at the end and hit a slow path if any of the calculations overflowed. This is very dense on x86 or Arm.RISC-V does not have condition codes. This is great for microarchitects. Condition code registers are somewhat painful because they’re an implicit data dependency from any arithmetic instruction to a load of others. In spite of this, Arm kept them with AArch64 (though dramatically reduced the number of predicated instructions, to simplify the microarchitecture) because they did a lot of measurement and found that a carefully optimised compiler made significant use of them.
RISC-V also doesn’t have a conditional move instruction. Krste likes to cite a paper by the Alpha authors regretting their choice of a conditional move, because it required one extra read port on the register file. These days, conditional moves are typically folded into the register rename engine and so are quite cheap in the microarchitecture of anything doing a non-trivial amount of register rename (they’re just an update in the rename directory telling subsequent instructions which value to use). Compilers have become really good at if-conversion, turning small if blocks into a path that does both versions and selects the results. This is so common that LLVM has a select instruction in the IR. To do the equivalent with RISC-V, you need to have logic in decode that recognises a small branch forward and converts it into a predicated sequence. That’s a lot more difficult to do than simply having a conditional move instruction and reduces code density.
I had a student try adding a conditional move to a small RISC-V processor a few years ago and they reproduced the result that Arm used in making this decision: Without conditional moves, you need roughly four times as much branch predictor state to get the same overall performance.
Note, also, that these results predate any vector extensions for RISC-V. They are not comparing autovectorised code with SSE, AVX, Neon, or SVE. RISC-V has used up all of its 16-bit and most of its 32-bit instruction space and so can’t add the instructions that other architectures have introduced to improve code density without going into the larger 48-bit encoding space.
x86-64 has pretty large binary sizes; if your compressed instruction set doesn’t have smaller binaries than it should be redesigned.
I would take the measurements in that paper with a grain of salt; they aren’t comparing like-to-like.
The Cortex A-15 Core benchmarks, for example, should have also been run in Thumb-2 mode.
Thumb-2 causes substantial reductions in code size; it’s dubious to compare your compressed ISA to a competitor’s uncompressed ISA.
This paper has some size comparisons against thumb (along with Huawei’s custom extensions). This page has sone as well.