1. 25
    1. 18

      32 registers is an odd thing to focus on. In the analysis that Arm did, there was no difference between 16 and 32 for most code, but some crypto routines benefitted a lot from 32. Now that crypto is either done in fixed-function units or in vector registers, I’m less convinced by that argument. In addition, there is a large cost to 32 registers: every three-operand instruction costs 3 bits more, which is a big size increase in a 32-bit encoding. You can fit 8 times as many instructions in the same encoding space with 16 registers.

      3 address encoding is also interesting because around 2/3 of values have a single destination instruction. This was one of the motivations behind EDGE architectures. That implies that two-operand versions would work most of the time and a small encoding for a register to register copy might be a bigger win. This is especially true on microarchitectures with register rename.

      On the encoding, it’s worth noting that AArch64 optimises for very different microarchitectures that AArch32 and RISC-V. It is very much a big-core ISA: it is easy to make efficient on out-of-order processors with register rename. The key here is to minimise the number of short-lived temporary values that, in the best case need to be forward3d between pipelines and in the worse case need to be kept live across basic blocks until an instruction that invalidates them has moved out of speculation. Things like rich addressing modes are really important here. The goal is to maximise the work done per instruction, not to minimise the encoding space for a single instruction. RISC-V optimises the other way, which is why it’s a good fit for microcontrollers but not as good for laptop or server class systems.

      Weak memory models are not about simpler caches, they’re about simpler speculation. In a TSO architecture, you have to keep things in store queues for a lot longer because more things that happened on other cores can invalidate your memory access.

      The ‘reduced’ thing is more about orthogonality then absolute small size. Early RISC chips had a small number of opcodes because transistors were expensive but the later ones kept the core idea that there should be one way of doing a thing.

      Flags are still there on everything except RISC-V, and even there they exist in the floating-point ISA. They are annoying to implement but they’re a huge win for software. Compilers are really good at if conversion now and you can get the same performance on an architecture with conditional moves as one without if you have about less than half as much branch predictor state. The saving in the branch predictor more than offsets the cost of flags. They’re also essential for any kind of constant-time code to be fast, which is increasingly important with modern systems and side channels.

      The MIPS branch delay slot was still a win with single-issue longer pipelines because it gave an extra cycle to get the branch predictor result but aside from that I agree that it’s not a great idea. Some DSPs have a lot more delay slots. It’s easy to build a pipeline that fetches and executes 8 instructions per cycle if you have 7 delay slots. It’s hard to target for arbitrary code but great for a load of CODEC-like things.

      Register windows remain great if you have SPARC-like circular view of them and a microarchitecture that does asynchronous background spilling. Oh, and a memory model weak enough that nothing on the spill stack is guaranteed visible to anything other than the spill / reload logic without an explicit barrier.

      Loading or storing multiple values is still a win and modern vector ISAs all now have scatter/gather loads. The AArch32 version had some exciting consequences if you took a fault in the middle, which caused some pain. The other thing here that the article misses that made this painful: PC was a GPR. This was a huge mistake because you needed to decode enough of the instruction to know it’s destination register to know whether it would be a jump.

      RISC-V is the only ISA that has a link register and doesn’t special case it and this was a huge mistake. The JAL and JALR instructions are used almost exclusively with wither the zero register or the link register as targets. RISC-V wastes a full 5-bit target operand on these (including in their 16-bit C extension variants). This is a massive waste, given that you can always materialise the target address in another register and do a normal jump for the few places where you want something else. It’s sometimes useful for dynamic patching, but that’s not a use case you should optimise your ISA for. RISC-V uses them for the functions for outlined spill slots, but that is mostly necessary because they wasted too much of the encoding space to have efficient prologues. There’s a proposal for a 16-bit spill-multiple instruction, which will fix this.

      Treating the stack pointer as special is a good idea because the set of common operations that apply to the stack pointer is different from other things. There are also some store-forwarding optimisations that are easier if it’s special.

      Separate floating point registers avoids wires. It also lets you do different register rename for integer and floating point workloads. It’s common that the live set of values in FPU code fits in architectural registers and so you can get away with a lot less (sometimes with none) without it hurting perf too much.

      I’ve written about SVE and the RISC-V vector extension before. SVE is designed as a target for autovectorisation. This is a key point for RISC ISAs: they are designed around known-possible compiler optimisations. The crucial things for SVE are that you should be able to represent any loop as a vector (good for the compiler), every memory op is explicit (alias analysis is vital for both the compiler and core) and there is no state anywhere other than explicit registers (so interrupts let the OS preserve everything without re-executing any of the code). I think the V extensions is better now but the last draft I read did badly on several of these points.

      1. 3

        Dang, thanks for the painfully complete response!

        In the analysis that Arm did, there was no difference between 16 and 32 for most code…

        Oooh, this sounds like a good thing for me to read. Have any links or references?

        3 address encoding is also interesting because around 2/3 of values have a single destination instruction…

        …and indeed, the compressed instruction sets for RISC-V and Thumb both use 2-address codes almost exclusively. Should I mention that more prominently maybe? Though again, if you have any references that have hard and fast numbers though I would love to hear about them; for me most of this comes from compiler textbooks rather than inspection of real programs.

        …but the later ones kept the core idea that there should be one way of doing a thing.

        …except for all the acceleration instructions, the multiple addressing modes, the majillion bit-shuffling variations, the push/pop instructions, etc? I can’t call you wrong but I’m not yet convinced. Once you get out of the core instruction sets there’s always tons of special-case things, sometimes bringing their own instruction encodings when they’re important enough. Is the difference from older CISC instruction sets the separation between “this is a small, consistent and non-redundant core” and “these are all the extra things that you can do for extra performance”?

        RISC-V is the only ISA that has a link register and doesn’t special case it and this was a huge mistake.

        Heck, I thought that it wasn’t a special case on ARM32 but now that I double check, bl always uses the link register instead of taking it as an argument. You’re right, thank you for the correction.

        Treating the stack pointer as special is a good idea because the set of common operations that apply to the stack pointer is different from other things.

        Are they? I’d think most of them are just register-relative load/stores, same as any other pointer?

        I’ve written about SVE and the RISC-V vector extension before.

        Can I bother you for a link? This is an area I’d love to learn more about in practice. I tried googling it but can’t find anything like that connected to your name. :-/

        1. 1

          Oooh, this sounds like a good thing for me to read. Have any links or references?

          I don’t think they published it (they did a lot of analysis internally in designing AArch64, this is just one of the anecdotes I’ve heard). You can reproduce this fairly easily though: modify the LLVM back end to use only half of the registers (5-line patch) and see what it does. Don’t make the mistake Andrew did in his dissertation and just trust the static numbers though, to do this properly you want to dynamically instrument to program to record the CFG and then count the total number of spills and reloads across the CFG (or, ideally, execute the result, but applying the static counts of the new version to a CFG from the old one is approximately equivalent).

          Though again, if you have any references that have hard and fast numbers though I would love to hear about them

          This is also fairly easy to measure yourself. I did some of this analysis working on an internal (cancelled) clean-slate ISA design project. We explored EDGE architectures (if you can, look at Doug Berger’s keynote at ISCA about E2 from a few years back), which are promising in theory because they avoid register rename by making targets explicit. It turns out that they’re fantastic for about 2/3 of instructions but the distribution is as important as the count and you end up having to build all of that painful rename logic for values propagated between basic blocks.

          …except for all the acceleration instructions, the multiple addressing modes, the majillion bit-shuffling variations, the push/pop instructions, etc? I can’t call you wrong but I’m not yet convinced

          You’re right, that was an oversimplification. The key goal is that there should not be two instruction sequences of the same length, with no shorter version, that achieve the same thing. To give a trivial example, any instruction with an immediate operand can always be replaced with one that takes a register and another instruction or two that materialises the constant, but now you have a longer sequence. Thing like the bitfield extract and insert instructions replace a longer shift and mask sequence. In older CISC instruction sets, there were many equivalent-length instruction sequences for doing the same thing, which made it hard for compiler writers to know which they should pick.

          It’s worth noting that Arm no longer uses the term RISC, they use ‘load-store architecture’, which is far less ambiguous (though still not quite true now that AArch64 has atomic read-modify-write instructions),

          Are they? I’d think most of them are just register-relative load/stores, same as any other pointer?

          Take a look at the disassembly for any program (for any ISA) and count the instructions in an addressing mode that use two registers to calculate the address for a load or store. Now count how many of those use the stack pointer as one of the operands. If the number that use the stack pointer is more than zero, your program has some huge stack frames (not including on-stack arrays, which are not addressed by the stack pointer directly, instead they move the base of the array to another register with a stack-pointer-relative add or subtract and then use that). Now, in the other direction, look at the loads and stores that use immediate addressing. These will be quite similar, but the stack-pointer-relative ones will almost always be able to take advantage of a shift because stack spill slots are register-width aligned.

          In addition, about the only non-memory operations that use the stack pointer are add and subtract (for creating / destroying stack frames) and occasional masking (to realign the stack).

          If your ISA has push and pop-like operations (including the pre- and post-increment stores and loads that both Arm ISAs have), you can cheat in the front end by tracking the stack pointer displacement in decode, so you don’t allocate a rename register for each intermediate, you just decode those into an additional accumulated immediate displacement. Then you issue a stack-pointer writeback before any branch where you’ve collected a displacement. This saves rename register pressure quite a lot.

          Can I bother you for a link? This is an area I’d love to learn more about in practice. I tried googling it but can’t find anything like that connected to your name. :-/

          Try adding site:lobste.rs to your query.

          1. 1

            Belated thanks again, now that I have the chance to come back to this and do some more research. Not sure I have the patience to dig dynamic execution benchmark stats out of useful test programs, but for anyone who looks at this in the future, I found a couple papers on the 16-vs-32 register question that are pretty interesting. They’re a bit long in the tooth now and I’m curious if an extra 20 years of compiler tech has made much difference, but from the sound of it, probably not a huge one.

            if you can, look at Doug Berger’s keynote at ISCA about E2 from a few years back…

            Found the abstract but no video anywhere, apparently. Very sad. :C I’ll keep digging for MS research on EDGE/E2 stuff though, there must be more out there.

            Try adding site:lobste.rs to your query.

            Still couldn’t find much, annoyingly, until I stopped trying to filter by user and just searched lobsters for articles mentioning RISC-V. Apparently you have some Opinions about it. ;-) Pretty interesting ones though.

    2. 13

      This statement bothers me a bit in a treatise on RISC generally: “Things I don’t know much about: PPC/POWER, SPARC, PA-RISC”

      PA-RISC hasn’t had market relevance in decades even though I have a great fondness for it, but SPARCs are still made, and although my pro-Power bigotry is well-known there are still lots of Power cores out there in the embedded space alongside many big systems from IBM and others. Those two architectures should really be a bigger part of this article if for no other purpose than to add more points of comparison.

      1. 3

        I only have so much time. ;-) You obviously know a lot more about them than I do, what interesting features do they have that significantly weigh for or against modern trends?

        1. 6

          SPARC had register windows. The first 8 were global (always visible) and the next 24 would shift by (if I recall correctly) 8 for each call/return. You have 8 for output paramters, 8 for input, and 8 for locals.

          1. 10

            SPARC’s register windows were a missed opportunity. All except the very last generation spilled synchronously. This meant that you hit sudden jitter when you ran out of windows. The ISA permitted the spill to be asynchronous, which meant that the core could write them out in the background when the store unit wasn’t busy. Other microarchitectures do the same thing with out out of order execution and long store queues, at the cost of a lot more power and area.

            1. 2

              I mentioned SPARC in the section on register windows, but I wasn’t aware of the potential for it to do the register spilling in the background. On the flip side… if you do have out of order execution and long store queues, then it helps with data latencies outside of function calls as well, so that more useful than asynchronous register window spills, and having a bit of both would cost more die space? It feels like a tradeoff either way, I don’t know how it would compare in reality..

              1. 1

                Not sure in general, but on Morello we’ve seen that function prologues seem to place the most stress on store queues. The logic for asynchronous spilling of register windows is a lot simpler than arbitrary store queueing and so I would expect a total area/power win, but I’ve never done serious benchmarks to validate this, so it’s intuition and not science.

        2. 4

          Others have commented on SPARC register windows, so I’ll be the Power shill (as usual): use of SPRs for things like the link register and branching instead of GPRs, FPU required until much later, r0 as 0 but only sometimes (led to a joke instruction called “mscdfr0”), the condition register (very handy for multiple comparisons), and VMX/AltiVec as a fairly powerful SIMD but with alignment restrictions (much improved with VSX).

          1. 2

            Thanks, the SPR vs GPR question gets more complicated with that info and I’m no longer sure if there’s really a trend one way or another. The multiple condition registers is new to me and sounds like a pretty interesting variation on the “dedicated flags register or not?” question, I’ll have to dig into that more!

            1. 2

              There’s another wrinkle with the link register, which means that it ends up being special on RISC-V, even though they try to pretend it isn’t: return-address prediction. The vast majority of computed branches (branch to a register value, rather than to a fixed location) are return instructions. Except in very special circumstances, the address of a return will be immediately after the call[1]. It makes sense to special case this in the branch predictor. This was really apparent to me when I tried out a new exercise for the Advanced Computer Design course in the Cambridge MPhil programme, which required you to get >70% hit rate for branches with some small example workloads: I added a 3-deep return-address predictor for indirect branches (this was on MIPS, so you could do the prediction after decode) and got over 99%.

              All of this requires that you can spot call and return instructions. On something like x86, they’re explicit. On most RISC architectures, call is some variant of jump-and-link[2] but return is just a jump-register. If you have a dedicated link register then it’s always a branch to the link register. If you use the link register for a jump that isn’t a return then you will confuse the branch predictor and performance will suffer. If you use a different register for your return, the same is true.

              [1] Except on SPARC, where any function that returns an on-stack structure is expected to add 4 to the link register value and skip over an illegal instruction, to catch the most dangerous forms of calling convention mismatch.

              [2] I believe the only time I have ever disagreed with Dijkstra was when he declared that a special instruction for subroutine calls that handled saving the return address was premature optimisation that an ISA should omit.

            2. 1

              Well, it’s one register internally, but it has eight independent fields which can be set separately (or not at all), and there are intraregister logicals like and, or, not and xor for compound conditions. It’s peculiar to Power but very useful.

              1. 1

                Itanium also had multiple condition-code registers. I strongly suspect that modern POWER systems don’t use a single register internally because the rename logic is far more complex for updating fields of a sub-register independently than it is for updating different registers.

                1. 1

                  Interesting about Itanium. The more I hear about it, the more I get the feeling its chief sin wasn’t entirely engineering.

                  1. 2

                    The chief sun was not talking to a compiler team before finalising the ISA (a decision also made by RISC-V, though the fact that they copied MIPS so heavily made their consequences less bad). There were a lot of interesting ideas in Itanium, but they assumed that they could make the compiler responsible for identifying all instruction-level parallelism. Unfortunately, there’s a fairly low upper bound here: the compiler can’t parallelise between basic blocks (at least, not without massive code duplication). Within a basic block, you average 7 instructions (this heuristic has remained true since the RISC-I project measured it), and they often have some data dependency and so Itanium ended up needing to add all of the machinery that more ROSC-like systems neeeded for dynamically extraction ILP.

                    The most interesting VLIW architecture that I’ve seen is nVidia’s Project Denver, where each the parallel pipelines are offset by one cycle and so can consume results of the previous instruction without it going via register rename. This makes it more EDGE-like and lets the compiler compile most basic blocks to a single VLIW bundle.

                    1. 3

                      Itanium’s greatest achievement was in killing PA-RISC, Alpha and possibly MIPS on workstations and servers without being successful itself.

                      1. 1

                        Yup, though from Intel’s perspective that wasn’t a total win because it was also meant to kill x86. If AMD hadn’t introduced x86-64, it’s not clear that the alternative would have remained dead. I’m quite curious how fast an Itanium emulator on a later-generation Alpha could have run with more modern emulation techniques.

                        1. 1

                          Whether intentional or not, I think it was a viciously brilliant move to set up a situation where Itanium’s success or failure would both benefit Intel. I’m personally convinced it was mostly intentional.

                      2. 1

                        One of SGI’s CEOs canceled a generation of hardware; they were forced to switch to Itanium because it wasn’t possible to recover their product line..

    3. 5

      32 bit instructions, but also small code size

      […] This makes your instruction decoder very simple: you load 4 bytes from a word boundary and that’s one instruction, no logic involved. If you have out-of-order execution then you can load 8 or 16 bytes or whatever at a time and still need to do exactly zero work to figure out where instructions are before starting to decode them. […]

      I’m not a processor designer in any capacity but this seems to me like it doesn’t get anywhere near enough credit.[*] Intel and AMD have poured hundreds of millions into R&D to somehow widen their decoders from 3 to 4 (Zen 3/Ryzen 5000) to 6 instructions deep. (Golden Cove/Core i 12000) x86 with its 1 to 15 byte wide instructions is an absolute pain to decode. Over a year prior to that, Apple came along and knocked out a core design (M1 Firestorm) which decodes 8 instructions at a time and absolutely romps just about any scalar IPC benchmark.

      I’ve often wondered why AMD didn’t restrict the possible instruction widths to make them easier to decode when they effectively redesigned x86 for 64 bits; I assume having near enough the same instruction encoding saved them a little die space by just adding a few tweaks in the new decoder mode.

      [*] I do irregularly hack on Qemu, and I’ve dug through its x86 instruction decoder source code…

      1. 3

        I’ve often wondered why AMD didn’t restrict the possible instruction widths to make them easier to decode when they effectively redesigned x86 for 64 bits; I assume having near enough the same instruction encoding saved them a little die space by just adding a few tweaks in the new decoder mode.

        At a guess, if the instruction decoding had been too different from x86 at the time, perhaps that would have caused the chips to be slower at running existing x86 code, which would have made amd64 dead in the water commercially?

        1. 2

          I’m not saying the instruction set should have been radically different - Intel tried that with Itanium. I’m just wondering if they could have simplified decoding in long mode by retiring some of the less frequently used encoding quirks (replace their use with explicit instructions), and perhaps by standardising instruction lengths a bit more. Maybe even make all possible instruction encodings a multiple of 2 bytes and requiring 2-byte alignment or something along those lines.

          This would be in preference to further complicating matters with the optional REX prefixes which in practice end up being required for a large number of instructions anyway. So basically turn all 1-byte instructions into 2-byte by effectively always mandating REX, except by doing so you free up the high 4 bits of REX to be used productively and shorten/simplify other instructions. Or perhaps they could have made the first instruction byte encode the length of the instruction as well as the 4 usable REX bits.

          But I suppose perhaps they did consider such a design internally and it worked out to produce larger code when applied to test corpus of code and I’m underestimating how frequently REX can be omitted and the fancy memory indexing modes are used. Or it ended up diverging too much from the 16- and 32-bit decoder modes to be able to reuse any silicon area.

      2. 2

        …it doesn’t get anywhere near enough credit.

        My impression is it gets more attention from hardware designers than from software people. ;-) I have a good friend who’s a computer engineer by training and she will rant about x86 instruction encoding if given the opportunity. If you’re using a high-level language though, or even most of the time in assembly code, it’s basically invisible.

        However, Intel tried reasonably hard to push x86 Atom chips downward into the smartphone market in the early 2010’s, and never really got any traction because they could never match ARM32’s power efficiency for a given performance. I don’t have any real evidence but I always thought that the instruction set complexity had a significant amount to do with that; having to spend more die space on the instruction decoder hurts more and more the smaller your chip is. I’d love to know if anyone has more information about this.

        1. 5

          My impression is it gets more attention from hardware designers than from software people. ;-) I have a good friend who’s a computer engineer by training and she will rant about x86 instruction encoding if given the opportunity

          x86 is a pretty extreme case. One colleague of mine suggested that it’s inaccurate to call it an instruction decoder and it’s more accurately an instruction parser. There’s a lot of truth to that. Most other variable-length instruction sets have a simple mechanism for finding boundaries (either a small header in the first word of an instruction or a stop-bit encoding where every boundary instruction has a 1 or a 0 in a place where the middle 16-byte chunk of another instruction has the other value). In x86, a lot of the long instructions are composed of a chain of prefixes and you have to parse them in order.

          AArch64 is actually a variable-length instruction set now. SVE uses prefix instructions. These are quite a neat idea, they remain 32-bit instructions but each one then modifies the behaviour of the following one. You can still find instruction boundaries easily and you can do a first decode, and if you encounter one of the special prefix values then you discard the decoding of the second instruction and do it again. At worst, this means decode is two cycles and decoding an 8-instruction bundle can happen in parallel (in the worst case, becoming a 4-instruction bundle on the way out, but when they’re SVE instructions the decode overhead is trivial in comparison to the very wide vector operations).