1. 23
    1. 4

      SIMD Instructions Considered Harmful is cited by the linked article; it presents a much better case for the author’s arguments and includes real example code.

      1. 3

        I actually find that article less convincing. The real benefit of vector SIMD as far as I’m concerned is variable operand size and the ease with which compilers make better code. I have no idea whether it’s faster to have a CPU that mechanistically consumes fixed-width instructions and does what they say, or to have a CPU that essentially has a small special-purpose microcode engine for vector number crunching, but I do know that instruction density is only part of the question. Which approach is faster depends on the CPU design, and afaict which approach can be faster depends on a balance of engineering tradeoffs (memory bandwidth, # of transistors you can devote to it, etc) that is always changing. It does appear that the long-term trends are moving in favor of the vector SIMD approach, but that doesn’t mean fixed-width SIMD is evil or crazy or stupid, just that maybe its time has passed.

    2. 1

      Finally, any software that wants to use the new instruction set needs to be rewritten (or at least recompiled)

      Since that reads like a complaint - What’s a realistic situation where you compile an app where you use 128b wide simd and would like the same ops to transparently use 256b instead if available? How could that work given the code copying the values likely wouldn’t get the same treatment?

      1. 9

        This is what SVE accomplishes. When doing autovectorisation, the compiler first identifies sets of identical operations on independent data that can be executed in parallel. It then tries to split this into chunks of the available vector width.

        Consider everyone’s favourite toy example: matrix multiplication. You’re multiplying every element in a row in one matrix by every element in a column of another. The amount of available parallelism is the size of that row/column. If you’re multiplying 1000x1000 matrixes then you could (trivially, for the inner loop, excluding any loop-nest optimisations) multiply 1000 elements in parallel.

        If your elements are double-precision floating-point elements then a 128-bit vector width means that you can multiply two. You compile your code with SSE and you get a nice 2x speedup. Then AVX comes along and you could multiple four at a time, but you need to recompile and now the new version doesn’t run on older CPUs. Then AVX-512 came out and you could get an 8x speedup over your baseline, but you can’t actually ship that because most customers don’t have CPUs with AVX-512 and you don’t want to ship a third version.

        With something like SVE, the compiler instead emits a loop that queries the vector width and emits loads, multiplies, and stores for every element. Because 1000 is not a power of two (or, if it’s not a constant, is not known at compile time to be a power of two), it will also include a tail portion that uses masked operations so the last chunk will use less than the whole register.

        If you run this on a mobile device with a 128-bit vector unit, each loop iteration will multiply two elements (assuming no loop unrolling). If you run the same binary on a Fujitsu supercomputer then each loop will multiply 8 elements, possibly 16 in the next generation, with no recompile.

      2. 6

        To complement other answers, there’s the generic concept of SWAR, “SIMD within a register”, also known as “broadword” techniques. In SWAR, any operation which can be done on a list of single-width data can be doubled up to operate on a list of double-width data, which corresponds to your example of 128 to 256 bits. In general, SWAR algorithms get faster as they are ported to ISAs with wider busses. It is a simplified (decategorified) analogue of programming a GPU, where different workloads can run with differing amounts of parallelism without altering the physical processor.

      3. 5

        Since that reads like a complaint - What’s a realistic situation where you compile an app where you use 128b wide simd and would like the same ops to transparently use 256b instead if available? How could that work given the code copying the values likely wouldn’t get the same treatment?

        LINPACK, simulations, weather prediction, many of the SPEC floating point benchmarks, and other HPC applications can benefit substantially from wider vector registers. The author’s alternative is variable length vector registers such as ARM’s SVE. Vector width depends on the implementation; no code changes are needed for wider vector registers.

      4. 3

        Every time you apply the same (series of) operations to thousands or millions of items (e.g. compute a sum of two vectors). In such case it doesn’t matter how many items you process per instruction.

        A special case of this is autovectorization. You write a loop that processes one item at a time, and pray that the compiler can figure out how to apply widest instructions possible.

        1. 1

          In such case it doesn’t matter how many items you process per instruction.

          But it does! If you know you don’t need a precise answer, then going for 4x low-precision op often gives you speedup over 2x high-precision.

          1. 1

            If you know you can use low precision, and have 1000 elements to process, then you don’t care if it’s 250 instructions of 4x low-precision on 4x-wide hardware, or 125 instructions of 8x low-precision on 8x-wide hardware, or 1 instruction on vector hardware, or 1000 instructions on non-SIMD hardware. It only matters that you get the fastest/widest instruction each hardware can do.

      5. 3

        Video games doing vector/matrix ops.

      6. 2

        I think utf-8 validation with SIMD falls under this use case since its “how many bytes can you check at once?” It has been a while since I read the paper though.