1. 23

  2. 23

    A few observations:

    But if SIMD is so awesome, why are the RISC-V ditching it and going for Vector processing instead?

    RISC-V is not. There is an official SIMD extension to RISC-V. That said, RISC-V began life as the control-plane processor for Hwacha, a Cray-style vector processor, so it’s not surprising that it would also gain a Hwacha-style extension.

    Thus SIMD as it has developed is untenable. There are new instructions every few years

    The new instructions are not added for fun, they are added to introduce more complex operations that provide a real-world speedup for at least one workload. Whether you use SIMD or Cray-style vectors, you’ll still have this problem. Vector add as the example hides this. There are a lot of complex vector operations that exist in x86 and Arm because profiling on existing code showed that they would be useful.

    There’s a lot in this article that’s confused or misleading. For example, it conflates the vector register width and ALU width in SIMD systems. Intel’s Atom processors, for example, are a counterexample of this, they have 128-bit SSE registers that feed 64-bit ALUs. These still provide a speedup because each vector operation dispatches in two cycles and doesn’t need any extra decode or register renaming.

    The digression about ML is also weird. ML processors are custom because they use custom data types. This is one of the big reasons for bloat in SIMD instruction sets: even just for add, you need variants to add two vectors containing 8-, 16-, 32-, and 64-bit integers, 16-, 32-, and 64-bit IEEE floating point values. ML adds a variety of 1-, 2-, 4-, 8-, and 16-bit data types, so requires a load more instructions (at least 5 more instructions for each of your existing instructions, so even with the basic set of C operators you’re looking at a pretty large number of instructions).

    The article also completely ignored Arm’s Scalable Vector Extensions. These are a very interesting half-way step that provide a lot of the benefits of both. With a Cray-style vector, the CPU is responsible for defining the loop on each instruction. That makes pipelining quite difficult. Imagine a simple vector ISA with no fused multiply add. You’re doing an operation of r3 = r2 + (r1 * r0). With a cray-style vector, this will look roughly like:

    vload r1, ...
    vload r2, ...
    vload r3, ...
    vmul r3, r1, r0
    vadd, r3, r3, r2

    On a SIMD system, it will look more like:

    vload r1, {r1 base}, {loop induction register}
    vload r2, {r2 base}, {loop induction register}
    vload r3, {r3 base}, {loop induction register}
    vmul r3, r1, r0
    vadd, r3, r3, r2
    add {loop induction register}, {vector size}
    compare-and-branch {if we've finished}

    Now assume you do this on a 2048-element vector on a processor with a single 64-bit vector add unit and a single 64-bit multiply unit. You’d like to do this, per clock cycle:

    1. Multiply the first 64 bits.
    2. Add the first 64 bits to the result of 1, multiply the second 64 bits.
    3. Add the second 64 bits to the result of the multiply in cycle 2, multiply the third 64 bits.

    And so on. With a Cray-style vector unit, you now have two many-cycle instructions that are partially completed. Now what happens if you take an interrupt? The processor either needs to save all of the state associated with those partial instructions or it needs to discard a potentially large amount of work and redo it. This gets even more fun if the sequence includes a vstore that may alias with the vload.

    To make this even more fun in the case of the RISC-V V extension, there’s a limit to the maximum vector size, so if your data type is not fixed at compile time (e.g. if you’re multiplying two arbitrary-sized matrixes) then you need to handle the case where your vector width is larger than the maximum vector width supported by the processor.

    This is much easier in the SIMD version. Every operation is on an architectural register and a trap just needs to preserve the architectural register state. SIMD units typically have a lot of vector registers so that they don’t need much hidden state for good performance (the number of rename registers is only slightly larger than the number of architectural registers). Because each loop iteration has a single multiply and add, it’s trivial for the CPU to pipeline these and forward the result of the multiply into the add. The fact that you’ve only multiplied and added half of a 2048 element is all architectural state.

    Amusingly, the Arm alternative is far more in keeping with the RISC philosophy. With SVE, the CPU has a load of vector registers that are 128-2048 bits (implementation defined) and the size can be queried. The compiler then generates a SIMD-style loop that can operate on any of these vector sizes, querying a MSR to find out what the size is (this is used as the stripe size for the loop induction variable in the simplest cases). As with traditional SIMD, all state for part of a source-language vector is architectural and so the compiler can reason about it and ensure aliasing is not a problem and the OS can easily store it in trap frames.

    The article also ignores the elephant in the room: vector units on CPUs are not designed for hand-written vector code anymore. They’re designed for auto-vectorisation. This is why scatter-gather and predication are so important: they allow the compiler to use vector instructions for any loop that has no loop-carried dependencies (or which can be rewritten to eliminate loop-carried dependencies), irrespective of whether it operates on regular data or has a power-of-two number of iterations. Again, these instructions contribute a lot to the SIMD instruction bloat that the author complains about and are there because they make a huge difference to the amount of code that is amenable to autovectorisation.

    Looking at the RISC-V V extension, the vsetvl instruction controls the vector length for all vector registers and so I expect code that handles mixed-length vectors (i.e. most autovectorised code) will need a lot of those. It’s really unclear to me that this has any benefits relative to SVE and it has several disadvantages.

    1. 5

      I don’t know where you got your information about Cray 1 and RISC-V but it’s obviously some kind of misconception.

      The major difference between SVE and the others is that SVE processes data that is less than the vector register in length using predication while RISC-V and Cray have an explicit vector length register.

      On Cray 1 you write a loop to do C = A + B for arbitrary length vectors like this:

      while (n > 0){
          int len = n > 64 ? 64 : n;
          vec A = vload(a_ptr);
          vec B = vload(b_ptr);
          vec C = vadd(A, B);
          vstore(c_ptr, C);
          a_ptr += len;
          b_ptr += len;
          c_ptr += len;
          n -= len;

      If the vector is shorter than 64 elements then the last elements of the register will not be processed. If the vector is larger than the vector registers then the loop will execute multiple times. If the vector length is not a multiple of 64 then the final shorter vector will automatically be processed the same as a short vector would.

      You don’t need the ugly tail cleanup cod (and often initial code to make the vector aligned) than SIMD needs.

      The major difference between Cray and RISC-V is that Cray 1 came with only one size of vector register (64 floating point values) and the programmer had to know the length (as shown above). On RISC-V the vector register length can be anything from 1 element (of the maximum element size supported) to 2^31 elements (or bits – I don’t remember right now and I think maybe it hasn’t been definitely decided yet). Certainly much more than SVE’s 128 bit to 2048 bit architectural limit.

      A RISC-V vector add looks like this:

      while (n > 0){
          int len = vsetvli(n, vec32i_t);
          vec A = vload(a_ptr);
          vec B = vload(b_ptr);
          vec C = vadd_32i(A, B);
          vstore(c_ptr, C);
          a_ptr += len;
          b_ptr += len;
          c_ptr += len;
          n -= len;

      The RISC-V program doesn’t know or care how long the vector registers are. The hardware tells you, on each loop iteration.

      It seems you know that the Cray 1 processed vectors one element at a time, taking 64 cycles to process the entire vector register. You also know that due to “chaining” the hardware could load the first element of A and B in cycle 1, add those elements in cycle 2, and store them in cycle 3.

      RISC-V imposes no such limitation on the hardware implementation. Some implementations might process one element at a time like the Cray 1, but it’s much more likely that they would process 2 or 4 or 8 elements at a time, with or without chaining. The most common would probably be to have a quarter as many ALUs as the vector length so that it takes 4 clock cycles to process the entire vector. Other implementations might process the entire vector register in parallel in one clock cycle.

      The chip designer can size the vector registers based on the expected workload, how many ALUs they want to build, and the latency and bandwidth of the memory system (whether some cache level or RAM).

      The programmer doesn’t have to know anything about this. The code is identical for every machine, and runs optimally given the choices the CPU designer made.

      If the implementation executes RISC-V vector instructions in a single cycle then there is no implication for trap handling. Even if it is 2 or 4 cycles that may not be a problem – just complete the instruction. But the RISC-V Vector extension has a vstart CSR that can be used by the hardware to save the point in the instruction that it was at when an interrupt occurred, and when the interrupt returns the hardware can re-run the instruction starting from that point.

      It’s true that an implementation using chaining might have a bit of fun trying to save the machine state on an interrupt. If you want to make such an implementation then you could choose to take that pain, or you could say that the cores (often fairly simple minion cores) with the vector unit either don’t take take interrupts at all, or have potentially quite long interrupt response, and direct most of the interrupts in the system to some other core.

      You say “The new instructions are not added for fun, they are added to introduce more complex operations that provide a real-world speedup for at least one workload.”

      The author of the article (who I admit was somewhat confused – it’s best to read the original Patterson and Waterman article, or the draft reference manual and code examples) was not talking about adding useful new instructions. He was talking about making a complete set of instructions for MMX and then a few years later throwing those away and making a complete set of SSE instructions. And then a few years later making a complete duplicate set of AVX instructiond. And then AVX512.

      It’s little different in the ARM world with DSP instructions, SIMD extensions for Multimedia, NEON, SVE, and MVE.

      The RISC-V Vector extension has a single set of instructions which work the same, running the same binary, on machines with vector registers from 4 bytes up to gigabytes (if someone ever wants to build such a thing).

      The initial version of the RISC-V V extension has all the useful instructions that previous vector or SIMD machines had. The Working Group has members from many different companies, with experience ranging back to the CDC6600 and Cray, to more modern supercomputers and DSPs and everything in between.

      That doesn’t mean that more won’t be added in future, but version 1.0 draws on a long history. WHat is sure is that new instructions won’t be needed in future simply because the register length got doubled (again).

      1. 5

        With a Cray-style vector unit, you now have two many-cycle instructions that are partially completed. Now what happens if you take an interrupt?

        This is rhetorical, right? You know what happens. It is specified in RISC-V V specification, 18.1, precise vector traps. The only state is vector index, there is no redo. It’s not a big deal.

        1. 4

          Sure, which means that if you’re doing any forwarding you have to roll back everything after the first instruction in the sequence that you’re forwarding values between. That’s a lot of microarchitectural state that you have to keep during speculative execution (a lot more than in an equivalent SIMD execution) to roll back. That doesn’t matter in a DSP or HPC accelerator but it does matter in a lot in a CPU core.

          There’s a reason that the extension to AArch64 co-developed by the company responsible for most of the mobile market and the company that routinely designs the world’s top supercomputer was not a pure Cray vector architecture. You’ll note that it is, in fact, at the top of the TOP500 list now, yet also scales nicely down to mobile phone cores. Looking at the top 10 in the TOP500 list, I don’t actually see any Cray vector processors. It’s possible that everyone is missing a trick, but given that Cray used to dominate that list 20-30 years ago, I’m somewhat skeptical.

      2. 7

        This would interest me, but the article is behind the Medium account wall. Workaround:


        1. 11

          Notice the “cached” button on every story in lobsters.

          1. 4

            Doh, how could I have missed that. Thanks!

            1. 2

              This doesn’t seem to work. Links to archive.md that just times out. never heard of archive.md before? What is it?

              1. 3

                A bit like an on-demand Wayback Machine. It archives links.

                Good chance you’re using Cloudflare for DNS resolution. Apparently there’s some disagreement between Cloudflare and archive.md, leading to failed resolution.

            2. 3

              If you set up a cookie-autodelete plugin you’ll never see the medium login-wall, I forget it exists until somebody like yourself (rightly) complains about it.

            3. 5

              I really like this in concept, it basically moves the API barrier of SIMD hardware so that the software using it can be a little more abstract. I think this is desirable because it means the SIMD hardware can be a lot more flexible in how it works; the size of register, size of operation, etc. are all things that the software doesn’t care about. As the article points out, this means beaucoup instruction density and more implementation flexibility.

              However, I am wary because this also means the CPU is in charge of keeping track of more state, and that means more data dependencies for it to be aware of. I don’t know much about CPU design, but I’m imagining it trying to do things like keep data caches in sync, reorder instructions or do a context switch and suddenly there’s just this big chonk of vector code dealing with variable-sized input that it can’t reason much about.

              So all in all I like this but by the end it gets a little religious in tone. It would be nice if it talked about some of the downsides of the approach and how well it scales up to more complicated workloads, instead of just saying “you can make real good, low-power vector-crunching processors out of this”. We know you can make real good, low-power vector-crunching processors by taking out a bunch of the fancy fluff that makes modern CPU’s fast, that’s what GPU’s and embedded DSP’s are for. We’ve been doing that for decades. If you’re going to contrast this with desktop SIMD, you should talk about how well it does when used in situations where you’d normally use desktop SIMD.