1. 62
  1.  

  2. 12

    People still call ARM a RISC architecture because RISC is about timing and load/store, not about instruction count, a common misconception.

    1. 3

      Timing isn’t that much of an issue anymore I think, now that there are out-of-order superscalar implementations of x86, ARM, PPC, and even m68k (remember m68k?).

      RISCs only became successful when those chips started to have pipelining (see: MIPS, ARM2), otherwise there would be too much decoding overhead, compared to architectures that have eg. specific instructions for loops (rep and loop(cc) in x86, djnz in Z80, bne with pre-decrement addressing mode in 68k, …)

      1. 2

        Perhaps also instruction format (predictable forms, fixed-length) though this isn’t guaranteed; ARM has Thumb, PPC has VLE, and ROMP was dual-length.

      2. 18

        People still call ARM a “RISC” architecture despite ARMv8.3-A adding a FJCVTZS instruction, which is “Floating-point Javascript Convert to Signed fixed-point, rounding toward Zero”. Reduced instruction set, my ass.

        This seems a bit like a “no true Scotchman” to me. Just because ARM has a bunch of instructions that solve specific cornercases, doesn’t mean it’s no longer “reduced”. It still has much fewer instructions than the dominant CISC architecture (by an order of a magnitude). Reduced is not the same as “absolute minimum”.

        “Pure ideas” like “only use the bare minimum instruction set” rarely work well in practice, hence FJCVTZS and other exceptions, which is perfectly reasonable, and doesn’t make ARM “not RISC”.

        1. 16

          Just because ARM has a bunch of instructions that solve specific cornercases, doesn’t mean it’s no longer “reduced”. It still has much fewer instructions than the dominant CISC architecture (by an order of a magnitude). Reduced is not the same as “absolute minimum”.

          I kinda agree in a theoretical sense, but I think that redefines what RISC means really heavily. When RISC was first invented, it really meant three things: load/store architecture with lots of registers, all instructions the same length, and…well, it did effectively mean a minimalist, if not minimal, instruction set. Some CPUs carried that a lot further than others (e.g. SPARC originally didn’t even have multiply and divide instructions, many didn’t allow unaligned loads, etc.), but all had markedly smaller instruction sets than their contemporary CISC counterparts, frequently from a mixture of fewer addressing modes and genuinely cutting out stuff.

          Modern ARM might have fewer instructions in practice than AMD64, which is the comparison I assume you’re making with “the dominant CISC architecture”. I’m not sure, to be honest, because things like NEON and FJCVTZS add tons to the ARM side, and AMD64 cuts out whole subsystems when running in 64-bit mode. But modern ARM definitely has more instructions than the 68k and x86 that were around when these terms were hatched. Hell, I can definitely spit out the entire 68020 instruction set from memory, and probably get close on the 80486 if we exclude the FPU, but there’s no way I can regurgitate the entire current ARM instruction set. It’s freaking huge. The only things I see RISC-like in the classical definition about a modern ARM CPU is the fact that all the instructions are the same length and there are plenty of registers. You can convince me that merely having something like NEON can still be a RISC CPU, but FJCVTZS definitely doesn’t fit the classical definition of RISC.

          And while I wouldn’t phrase things quite the way the article did, I think it’s true: what RISC ultimately brought to the stable was a reset of a bunch of classical CISC instruction sets that allowed it to run a lot faster. Most modern RISC CPUs (including e.g. PowerPC) have tons and tons and tons of instructions. And that’s fine. The right answer ended up being more in the middle than initially anticipated.

          1. 6

            The only things I see RISC-like in the classical definition about a modern ARM CPU is the fact that all the instructions are the same length and there are plenty of registers

            uhh, how about the fact that it’s a load-store architecture, instead of having other instructions directly reference memory? That feels like the defining “RISCiness” trait these days.

            1. 2

              Yeah, that’s fair. Although I actually thought ARM had added some stuff in the newer multimedia instructions that were basically fused operation/store instructions. But that’s not my wheelhouse, so I may have misunderstood or it might’ve been a different CPU.

            2. 2

              The only things I see RISC-like in the classical definition about a modern ARM CPU is the fact that all the instructions are the same length…

              Aren’t Thumb instructions variable length?

              1. 1

                Thumb is halfword length.

                1. 1

                  Thumb instructions can be 16-bit or 32-bit in length. That sounds like variable length to me.

                  1. 7

                    There is a significant difference between instructions that vary from 1 to 17 bytes (x86) and instructions that are 2 or 4 bytes (Thumb2). The former is much more complex than the latter.

                    1. 4

                      Thumb2 is a variable-length instruction set. x86 is an absolutely-bonkers-length instruction set. :-P

                      1. 2

                        I guess that fits the idea that ARM doesn’t live up to the ideal of RISC but is still more RISC-like than CISC-like?

              2. 10

                It still has much fewer instructions than the dominant CISC architecture (by an order of a magnitude).

                It probably has fewer of them, but I don’t think the difference is an order of magnitude. The base ISA of x86 is of course much larger, but both x86 and ARM have multiple ways to do floating-point operations and SIMD (x87, MMX, 3DNow!, SSE, AVX vs FP, VFP, iwMMXt, NEON, SVE, not counting all the different versions). ARM also has multiple instruction encodings: ‘normal’ ARM (32-bit), old and obsoleted Thumb (16-bit), ThumbEE/T32 (variable-length), A64 (64-bit).

                From this, I can only conclude the words “CISC” and “RISC” don’t mean that much these days. If you want to learn about CPUs, it’s better to look at how they actually work instead of reiterating the debates on their design from decades ago. Secondly, the newer x86 additions look relatively “RISC-like” (SSE and friends don’t have funky loop instructions, x86_64 registers are general-purpose, …) and some RISCs (or companies that used to make RISC chips) have adopted more CISC-like approaches in some of their designs (PPC and IBM, anyone?).

                1. 6

                  A64 (64-bit)

                  A64 instructions operate on 64-bit values, but themselves are 32 bits long.

                  the words “CISC” and “RISC” don’t mean that much these days

                  Indeed. I think one meaningful distinction is “load-store vs x86 style operations-on-memory-addresses”, and “RISC vs CISC” seems to be sort of becoming a shorthand for that??

                2. 2

                  It looks complex (pdf) compared to early RISC’s. That was 2008. It’s grown since then.

                  1. -2

                    “Floating-point Javascript Convert to Signed fixed-point, rounding toward Zero”

                    Maybe people just need to build better languages …

                  2. 10

                    The traditional difference between RISC and CISC is memory operands for every instruction instead of just load and stores (as you will notic3, x86 doesn’t have a load instruction, it uses the move-register instruction with a memory operand for that). Another defining feature of CISC architectures is the extensive use of microcode for complicated instructions , such as x86 sine and cosine instructions (FJCVTZS is not a complicated instruction, what it does is quite simple though it does look like a cat walked over the keyboard). Use of instructions encoded in microcode is strongly discouraged, because it is usually slower than the code written by hand. I don’t see how the rest of the article (cache access times etc.) is in any way relevant to the discussion how well RISC scales.

                    1. 2

                      What’s the point of having complex instructions (like sin, cos) encoded in microcode, if it’s slower?

                      I always thought the only reason many of the heavyweight x86 instructions was because they were faster than the fastest -hand-optimized assembler code doing the same thing. It certainly is true of the AES instruction set.

                      1. 4

                        What’s the point of having complex instructions (like sin, cos) encoded in microcode, if it’s slower?

                        Because it was once faster.

                        1. 3

                          The only one of the AES instructions encoded in microcode is AESKEYGENASSIST, all the other ones aren’t (split into 2 uops at most) on Broadwell processors. On AMD Zens, all of them are in hardware.

                          1. 1

                            What’s the point of having complex instructions (like sin, cos) encoded in microcode, if it’s slower?

                            Otherwise they need to be in hardware, taking up silicon.

                            1. 1

                              Sorry, I meant: why have these additional instructions at all, if they’re slower?

                              1. 3

                                Otherwise they need to be in hardware, taking up silicon.

                                Because they are part of the ISA, and it need to remain backwards compatible. They didn’t add the instructions for SSE and AVX, though

                                1. 1

                                  It’s easy and safe to break backwards compatibility for x86 extensions, because every program is supposed to check whether the CPU it’s running on supports the instruction before executing it. the CPUID instructions tells you which ISA extensions are supported.

                                  For example, if you runCPUID, it sets the 30th bit of ECX to indicate RdRand support and the 0th bit of EDX to show x87 support. The floating point trigonometric functions are part of x87. Intel and AMD could easily drop support for x87, by setting the CPUID flag for x87 to 0. The only programs that would break would be badly-behaving programs that don’t check before using x86 extensions’ instructions.

                                  1. 7

                                    The only programs that would break would be badly-behaving programs

                                    So, pretty much all of them? Programs don’t check for features that 99% of their userbase has. For example, on macOS you unconditionally use everything up to SSE3.1, because there are no usable Macs without them.

                                    When programs stop working, or become much slower due to traps and emulation, users aren’t any happier knowing whose fault was it.

                                    1. 1

                                      Are you a LLVM developer?

                                      1. 1

                                        No.

                            2. 2

                              Use of instructions encoded in microcode is strongly discouraged, because it is usually slower than the code written by hand.

                              I don’t think this is correct. Microcode is a very limited resource on the CPU, so why would anyone waste space encoding instructions that could be implemented elsewhere if they were slower?

                              1. 4

                                Because the instruction set has to be backward compatible. They decided that floating point sine and cosine where a great idea when introducing the x87, but they soon found out it wasn’t. There’s a suspicious lack of these instructions in SSE and AVX. Still, these instructions waste a lot of space in the microcode ROM

                            3. 14

                              Hot take: RISC/CISC as design philosophies are both fairly obsolete in 2019. RISC was revolutionary at the time because it designed the language for compilers to write more than humans, at a time when lots of assembly was still written by humans. They has to make fewer compromises and could make faster hardware as a result.

                              Now all essentially all software comes out of a compiler with a sophisticated optimizer, and the tradeoffs hardware designers are interested in are vastly different than the 1980’s. RISC and CISC principles only matter as far as they affect those tradeoffs: memory bandwidth/latency, cache coherence, prediction/speculative execution, etc. (I am not a hardware guy, but I do like making compilers.)

                              Edit: actually, despite the slightly ranty start and finish, this article is great. It talks about basically all this and more.

                              1. 10

                                One interesting problem is that software comes out of a compiler whose model of the machine is essentially a PDP-11..

                                1. 6

                                  That RISC vs CISC distinction is about the compiler backend rather than the language itself. In PDP-11 , *p++ was translated to just one instruction—because PDP-11 designers anticipated that people would do it often and included instructions for that. Same reason in CISC arhictectures you always have most if not all commands support all operant combinations: register-register, register-memory, and memory-memory. People hate doing the load/store cycle by hand.

                                  RISC is explicitly about not caring for people who write it by hand. For a compiler backend author, it’s trivial to add load and store—once. It’s perfectly RISC’y to break down a thing a human would want to be a single instruction like *p++ into individual parts and let compiler writers decide which one to use. It was never about raw instruction count.

                                  1. 3

                                    Funny thing, *p++ = reg is a single instruction on ARM, I don’t think there is a way to do that in one instruction on x86.

                                    1. 1

                                      Interesting. I haven’t looked deep into ARM, somehow. I need to get to finally reading its docs, though if you know of a good introduction, please point me to it.

                                      1. 1

                                        Sorry, I don’t know a good entry point, I started reading the ISA some time ago but never finished it. It is reasonably readable, though (personally, I think the most readable is by far ppc64, followed by ARM with x86 being the worst)

                                        I just looked up the instruction though, it was ldr xT, [xN], #imm (ARM instruction are sooo unreadable :( )

                                        1. 2

                                          x86 has the benefit of third-party tutorials, even though they tend to be a) outdated b) worse than outdated c) plain wrong d) all of the above. My personal favorite is the GFDL-licensed “Programming From the Ground Up” (https://savannah.nongnu.org/projects/pgubook/), which is in the first category—ironic because it was completed just before x86_64 became widely available.

                                          An ARM64 fork would be a perfect raspberrypi companion.

                                          1. 3

                                            I’d use the book with caution, for example all the floating point stuff is completely out of date (noone uses x87 anymore, everyone just uses SSE with half-initialized registers)

                                        2. 1

                                          The ARM System Developer’s Guide is dated (ARM v6) but I found it helpful. If you know another assembly language then I would just skim an instruction set manual.

                                          1. 1

                                            The official A64 ISA guide is quite nice

                                          2. 1

                                            x86 string store instructions (STOSB, STOSW, and STOSD) let you do this for the DI or EDI register.

                                            1. 6

                                              True. Use of these is discouraged, though. On Zen, they are implemented in microcode.

                                              Also, I found this gem in their documentation:

                                              At the assembly-code level, two forms of the instruction are allowed: the “explicit-operands” form and the “nooperands” form. The explicit-operands form (specified with the STOS mnemonic) allows the destination operand to be specified explicitly. Here, the destination operand should be a symbol that indicates the size and location of the destination value. The source operand is then automatically selected to match the size of the destination operand (the AL register for byte operands, AX for word operands, EAX for doubleword operands). The explicit-operandsform is provided to allow documentation; however, note that the documentation provided by this form can bemisleading. That is, the destination operand symbol must specify the correct type (size) of the operand (byte, word, or doubleword), but it does not have to specify the correct location. The location is always specified by the ES:(E)DI register. These must be loaded correctly before the store string instruction is executed.

                                              x86 really is the pinnacle of human insanity

                                          3. 1

                                            load store by hand is a lot easier when you have a decent number of registers and do not have weird rules about which registers can be accessed from what instructions.

                                            1. 1

                                              I dunno… I wrote a lot of VAX assembly, which was nice but I always had to have the reference nearby because I couldn’t remember all the instructions. I also wrote a lot of ARM4 assembly, and it was small enough that I could remember the whole thing without looking stuff up. So I really preferred the ARM instruction set — of course nowadays it seems to have bloated up to VAX size, but back then the “reduced” nature of it seemed like a good thing even for humans.

                                              (Edit) As an aside, I really wish I could find a copy of the bitblt routine I wrote in VAX assembly using those crazy bitfield instructions. I worked on it for months. Then DEC came out with a faster one and I had to disassemble it to figure out how on earth they did it — turned out they were generating specialized inner loop code on the stack and jumping to it, which blew my mind.

                                            2. 1

                                              I know that is a common story, but it’s not correct. Compilers now can optimize for out of order execution, large numbers of registers (a key RISC insight although maybe one people should revisit), big caches and cache line size, giant address spaces, simd, … - all sorts of things that were not supported by or key to PDP11s. C itself was working on multiple architectures very early in the history of the language.

                                          4. 5

                                            With VLIW, you would very likely have to recompile your program for every single unique CPU

                                            Fun fact, GPUs were VLIW (e.g. TeraScale), this wasn’t a problem for them at all (you recompile shaders on load anyway), and yet the current GPUs (GCN) are all more “RISC” like (not VLIW). VLIW is kind of a failure.

                                            1. 1

                                              My question is… why is VLIW a failure? Yes, the compiler technology wasn’t up to the challenge in 2000, but in my opinion, it got there by 2010. Is the technical advantage not real, or is it just another case of path dependency?

                                              1. 4

                                                The biggest problem with VLIW is that you, to effectively use it, need large numbers of independent instructions inside a basic block. As it turns out, though, most code has a lot of branches and interdependent code, which forces the compiler to emit lage quantities of NOPs. Also, for tight loops you usually don’t need VLIW, vectorization is enough

                                                1. 1

                                                  and that is the one place where the people who complain about Algol like languages may have a point. I have never seen any work on e.g. executing APL like languages on a VLIW machine, but it seems like it should be interesting. My suspicion, however, is that this is just a fundamental from how digital computers do arithmetic.

                                              2. 1

                                                Oh? Where can I read about why they went RISC-like?

                                              3. 7

                                                I find RISC misguided. The RISC design was created because C compilers were stupid and couldn’t take advantage of complex instructions, and so a stupid machine was created. The canonical RISC, MIPS, is ugly and wasteful, with its branch-delay slots and large instructions that do little.

                                                RISC, no different than UNIX, claims simplicity and small size, but accomplishes neither and is worse than some ostensibly more complex designs.

                                                This isn’t to write all RISC designs are bad; SuperH is nice from what I’ve seen, having many neat addressing modes; the NEC V850 is also interesting with its wider variety of instruction types and bitstring instructions; RISC-V is certainly a better RISC than many, but that still doesn’t save it from the rather fundamental failings of RISC, such as its arithmetic instructions designed for C.

                                                I think, rather than make machines designed for executing C, the ’‘future of computing’’ is going to be in specialized machines that have more complex instructions, so CISC. I’ve read of IBM mainframes that have components dedicated to accepting and parsing XML to pass the machine representation to a more general component; if you had garbage collection, bounds checking, and type checking in hardware, you’d have fewer and smaller instructions that achieved just as much.

                                                The Mill architecture is neat, from what I’ve seen, but I’ve not seen much and don’t want to watch a Youtube video just to find out. So, I’m rather staunchly committed to CISC and extremely focused CISC as the future of things, but we’ll see. Maybe the future really is making faster PDP-11s forever, as awful as that seems.

                                                1. 6

                                                  “The RISC design was created because C compilers were stupid and couldn’t take advantage of complex instructions”

                                                  No. Try Hennesy/Patterson Computer architecure for a detailed explanation of the design approach. I

                                                  1. [Comment from banned user removed]

                                                    1. 3

                                                      Patterson:

                                                      we chose C and Pascal since there is a large user community and considerable local expertise.

                                                      You:

                                                      The RISC design was created as a compensation of sorts for the fundamentally poor C language.

                                                      Ok. Because Pascal == C ?

                                                      Patterson

                                                      Another reason why RISC is outperforming the VAX is that the existing C compilers for the VAX are not able to ?>exploit the existing architecture effectively.

                                                      You

                                                      I want to specifically mock this sentence; this was true for C compilers, but most languages are well-designed and >didn’t have this issue:

                                                      Oh that’s great to hear. Can you point me to either 1980s studies of “well designed” languages that were able to exploit VAX brilliantly? Or even current studies on how “well designed languages” exploit CISC instructions. Love to know. Patterson’s argument at the time was that compilers, in general, could not exploit the complex instructions of e.g. VAX. VAX had a single instruction to evaluate polynomials: love to know which “well designed languages” exploited POLY and what performance improvement they showed.

                                                      Maybe you could even go as far as naming those anonymous “well designed languages” that you have in mind? In 1980 maybe you mean COBOL, ADA, Bliss, PL/1 , Snobol?
                                                      Currently, maybe you want to point to the blazing CISC performance of ??? Java? Python? Haskell?

                                                      The Itanium designers had similar ideas to what you seem to have. On the other hand, all modern CPUs are RISC like (x86 translates CISC into RISC in the instruction decode step )

                                                      1. 3

                                                        Ok. Because Pascal == C ?

                                                        To be honest, pretty much. Some details differ, but they’re both procedural languages with similarly unsafe memory management. C may have more weird stuff with pointers (pointer arithmetic and function pointers), but they have a lot in common.

                                                        Now Modula and Oberon (which by Wirth’s owns words should have been named “Pascal 2” and “Pascal 3”), may be different.

                                                        On the other hand, all modern CPUs are RISC like (x86 translates CISC into RISC in the instruction decode step )

                                                        Reading the Reddit thread disabused me of that notion. Microcode and Micro ops are nothing like RISC, or CISC, or any other ISA for that matter. That internal stuff tends to be highly regular (even more regular than RISC instructions), and they often are about doing stuff that’s internal to the CPU, without necessarily a direct correspondence to the programming model of the CPU. Register renaming for instance is about mapping user-visible aliases to actual registers. I guess that micro operations are about routing data to an actual register, not its user-visible alias.

                                                        1. 1

                                                          Ok. so if you say RISC was designed with procedural Algol/FORTRAN like languages in mind, I’ll agree.

                                                          In an instruction set like the x86, where instructions vary from 1 byte to 15 bytes, pipelining is considerably more challenging. Recent implementations of the x86 architecture actually translate x86 instructions into simple operations that look like MIPS instructions and then pipeline the simple operations rather than the native x86 instructions! - Hennessy-Paterson Computer Organization and Design.

                                                          Remember that the whole philosophy behind all modern x86 CPUs, since the P6, is to decode x86 instructions into RISC-y micro-ops which are then fed to a fast RISC backend; the backend then schedules, issues, executes and retires the instructions in a smooth RISC way. - https://www.anandtech.com/show/1998/3

                                                          There is an example in https://www.agner.org/optimize/microarchitecture.pdf

                                                          1. 1

                                                            I wouldn’t trust the journalism piece too much, but that manual looks mighty interesting, thanks.

                                                            And of course instructions are broken down into simpler parts. x86 is crazy no matter how you look at it, it’s not surprising steps are taken to make it less crazy. I’d even go as far as guess that it makes more sense to cache decoded x86 instructions than it does caching decoded RISC-V instructions: the latter is just so much easier to decode it may cost less to decode them several time than keeping them in expanded form in the instruction cache.

                                                            Just saying that from what I’ve heard, there’s a difference between RISC-like instructions and how an instructions (of any kind) is broken down for pipelining and such. To be honest though I just don’t know. I have yet to seriously check it.

                                                2. 2

                                                  Give me a machine readable specification that tells the semantics and binary-syntax for every instruction in your system, and I would care significantly less about whether it’s RISC or CISC, as long as you keep it correct and secure system (as in real security sense, not DRM “security”).

                                                  Also.. Likely SPMD architectures have won the race. It could be such systems can be both simple and efficient. Also usable as soon as we get rid of procedural programming and move on.

                                                  1. 4

                                                    Note that ARM is the only widely supported architecture with machine readable specification. See ARM Releases Machine Readable Architecture Specification.

                                                  2. 2

                                                    Was there a time that Intel / CISC was implemented on top of RISC with a translation layer in microcode or something? I swear I was told this once but maybe it was an urban legend

                                                    1. 3

                                                      Modern x86 is translated into microcode that is very similar to what you’d get if RISC and VLIW got frisky with eachother, IIRC ARM does that for some complicated instructions. Microcode lets you optimize the CPU usage more.

                                                      VLIW would be essentially taking out that translation layer, the compiler has to make use of what the hardware offers.

                                                    2. 1

                                                      Is there much of a case for VLIW over SIMT/SIMD? (SIMT is the model most/all modern GPUs use, which is basically SIMD, but with conditionals that mask parts of the register/instruction, rather than the entire instruction)

                                                      My basic thinking is that if you have SIMD, conditional masking, and swizzling, you’re going to be able to express the same things VLIW can in a lot less instruction width. And SIMT is data-dependent, which is going to be more useful than index-dependent instructions of VLIW

                                                      Basically, I don’t see the case for having ~32 different instructions executing in lockstep, rather than 32 copies of one (conditional) instruction. It seems like it’s optimizing for something rare. But maybe my creativity has been damaged by putting problems into a shape that GPUs enjoy

                                                      1. 2

                                                        It is more a question between VLIW and superscalar out-of-order architectures (and not between SMT and VLIW), and there the latter ones clearly win. On a fundamental level, they are faster because they have more information at runtime than the compiler has at compile time.