1. 7
  1.  

  2. 3

    Note that while the article is broadly correct, RISC-V Vector Extension is still in development and the article is based on the old version. SETVL is now three arguments (not two) and renamed to VSETVL, for example.

    1. 3

      Yes. In particular VSETDCFG has been gone so long (rolled into VSETVL{I}) that it’s not even found in the RVV specification github. In October 2018 I did raise an issue about the then VREGCFG instruction which was similar. That all became moot in December 2018 when something close to the current LMUL scheme was proposed.

      VLD/VST have also changed a couple of times since his code example. And 4-address VFMADD has been replaced by two 3-address instructions where the destination register must be the same as either the addend or one of the multiplicands (which is true in his example code).

      So the DAXPY RISC-V example code can be converted to the current draft spec by deleting the first two lines and making minor adjustments to each of the five remaining vector instructions.

      https://github.com/riscv/riscv-v-spec/blob/master/example/saxpy.s

      vsetvli a4, a0, e32, m8, ta,ma

      “e32, m8, ta,ma” together specify the 8 bit VTYPE literal in the instruction.

      32 bit elements; gang 8 vector registers together to make longer registers (so you can only use v0, v8, v16 and v24 in the code); Tail Agnostic: you don’t care whether elements past VL are calculated/altered or not; Mask Agnostic: you don’t care whether masked-off elements are left unchanged or replaced by all 1s (especially because we’re not using masking here)

      1. 1

        As I understand it, the current version (which is indeed newer than the article) is likely to be ratified as-is this summer.

      2. 3

        Some of this feels like it was written in the ‘80s. There’s a lot going on in the SVE instruction and so there’s a high cognitive load on assembly programmers. Who cares? SVE is explicitly designed as a target for compiler autovectorisation. Your compiler doesn’t care that there are a lot of instructions with many variants as long as they’re orthogonal. It’s doing a huge tree-based pattern match over your code. It just wants an instruction to exist that matches the DAG nodes that represent your program. The predication that seems to confuse the author is part of this. Between predication and scatter/gather instructions, it’s possible to autovectorise loops that have divergent flow control and non-uniform memory access patterns. That’s a huge win. Anyone designing a vector ISA for assembly programmers would be completely missing the point.

        There’s a very different philosophy here. SVE, especially SVE2, is really aggressively designed around vectorising loops and avoiding having any hidden microarchitectural state. If you’re writing a sequence of instructions where data flows from one to the next, it does so via registers and if you take an interrupt in the middle then you have a tiny amount of state to discard, even if you’re only part way along processing a huge vector. You can chain together a load of operations on a part of a vector and have these complete in parallel with the next loop iteration if there are no loop-carried dependencies.

        The question that the RISC-V community should be asking is why a team that builds some of the worlds fastest supercomputers and is includes a number of ex-Cray folks (and so is intimately familiar with Cray-style vectors) designed SVE and not something more Cray-like.

        1. 2

          The RISC-V Vector ISA working group also has experienced supercomputer people on it.

          You might remember Steve Wallach from such classics as “The Soul of a New Machine” but he’s also done quite a lot since then such as found Convex computer, win the 2008 Seymour Cray Computer Science and Engineering Award for his “contribution to high-performance computing through design of innovative vector and parallel computing systems”.

          The RISC-V Vector extension is originally derived from the Hwacha experimental vector processor. An explicit design goal (taking up a large part of Hwacha designer Yunsup Lee’s PhD thesis) is to enable efficient compilation of SIMT code such as CUSA and OpenCL to a vector machine, using predication with support for managing divergent and convergent control flow.