I like this distinction. On the one hand you’ve got the bog standard data parallelism style (i.e. anything you’d write in FORTRAN or ISPC). Type 2 is what I’d call “creative” use of SIMD. Besides hyperscan and simdjson, another fun example is 55GB/s FizzBuzz. The third category is the “arcane magic”, like using gf2p8affineqb for index->bit conversions that was posted a while ago.
This taxonomy is also useful because it lets us discern between types of SIMD usage that compiler writers and PL designers target (mostly Type 1), and the types of usages that are best left to hand-rolled implementations (Type 2 & 3).
discern between types of SIMD usage that compiler writers and PL designers target (mostly Type 1), and the types of usages that are best left to hand-rolled implementations (Type 2 & 3)
Maybe this wouldn’t be worth the complexity, but is Type 2 that farfetched for compilers to do, e.g. for the usage the article summarizes as “using SIMD to simulate having an enormous general-purpose register”? When it would be worth it to try to use a SIMD register as a big general-purpose register seems like a quintessential thing for a compiler with an appropriate cost model for the target architecture to decide, especially if a language has built-in larger integer types that it has to compile somehow anyway, like Rust’s i128, or perhaps some future i256/i512. I don’t work on compiler backends though, so this might be completely naive.
Compilers can do some Type 2 stuff. For example, gcc can turn multiple contiguous stores into inserts in a ymm register followed by a store.
That said, I have doubts about more complicated stuff. Hyperscan, which is the library mentioned, is basically a very sophisticated regex interpreter on its own.
Aren’t horizontal and vertical flipped in the taxonomy? (Like, Horizontal sum ⟹ sum of all lanes, vertical sum ⟹ by lane sum of two or more SIMD registers?)
You’re not the only one to point this out, and I regret the overloading of that term. Since it’s just a term, there’s no good reason to overload it. I’m going to try to rework the taxonomy in the second draft to avoid this unfortunate collision, since the major point I want to make is to distinguish “do simple stuff many times” vs “do big stuff” for Type 1 and 2 respectively. However, I’ve also kind of bollixed up the different types of “Type 3” things, as frequently what makes the unique capabilities of SIMD exciting is “being able to do the same thing as normal programming, but without branches and memory accesses” (e.g. masking and table lookups with VPERM*/PSHUFB etc) NOT just “oh, I’ve got a 1-instruction special-purpose thing that would take me a 40 instruction loop otherwise”.
I think it’s horizontal because the lanes of the vector register are successive elements of the same array. The author does a lot of vectorized string algorithms so when you map characters to lanes of a vector register, the register is horizontal.
I think there’s a missing category: if you take the transpose, you can work on several related but independent calculations in parallel. eg, I’m working on a thing that wants to compare a query string with several keys, and I think it might make sense to swizzle the layout so the keys are interleaved and can be compared in parallel.
You mean something like this? I believe in the author categorization it would be vertical (“scalar” input, one “needle” per lane).
Edit: I remembered where I got the horizontal/vertical definitions. From Intel’s guide to maddubs:
Vertically multiply each unsigned 8-bit integer from a with the corresponding signed 8-bit integer from b, producing intermediate signed 16-bit integers. Horizontally add adjacent pairs of intermediate signed 16-bit integers, and pack the saturated results in dst.
So in Intel’s manuals horizontal means that the operation is between lanes of the same input vector, vertical means that it is between lanes of two different inputs.
I like this distinction. On the one hand you’ve got the bog standard data parallelism style (i.e. anything you’d write in FORTRAN or ISPC). Type 2 is what I’d call “creative” use of SIMD. Besides hyperscan and simdjson, another fun example is 55GB/s FizzBuzz. The third category is the “arcane magic”, like using
gf2p8affineqbfor index->bit conversions that was posted a while ago.This taxonomy is also useful because it lets us discern between types of SIMD usage that compiler writers and PL designers target (mostly Type 1), and the types of usages that are best left to hand-rolled implementations (Type 2 & 3).
Maybe this wouldn’t be worth the complexity, but is Type 2 that farfetched for compilers to do, e.g. for the usage the article summarizes as “using SIMD to simulate having an enormous general-purpose register”? When it would be worth it to try to use a SIMD register as a big general-purpose register seems like a quintessential thing for a compiler with an appropriate cost model for the target architecture to decide, especially if a language has built-in larger integer types that it has to compile somehow anyway, like Rust’s i128, or perhaps some future i256/i512. I don’t work on compiler backends though, so this might be completely naive.
Compilers can do some Type 2 stuff. For example,
gcccan turn multiple contiguous stores into inserts in aymmregister followed by a store.That said, I have doubts about more complicated stuff. Hyperscan, which is the library mentioned, is basically a very sophisticated regex interpreter on its own.
Aren’t horizontal and vertical flipped in the taxonomy? (Like, Horizontal sum ⟹ sum of all lanes, vertical sum ⟹ by lane sum of two or more SIMD registers?)
You’re not the only one to point this out, and I regret the overloading of that term. Since it’s just a term, there’s no good reason to overload it. I’m going to try to rework the taxonomy in the second draft to avoid this unfortunate collision, since the major point I want to make is to distinguish “do simple stuff many times” vs “do big stuff” for Type 1 and 2 respectively. However, I’ve also kind of bollixed up the different types of “Type 3” things, as frequently what makes the unique capabilities of SIMD exciting is “being able to do the same thing as normal programming, but without branches and memory accesses” (e.g. masking and table lookups with VPERM*/PSHUFB etc) NOT just “oh, I’ve got a 1-instruction special-purpose thing that would take me a 40 instruction loop otherwise”.
I think it’s horizontal because the lanes of the vector register are successive elements of the same array. The author does a lot of vectorized string algorithms so when you map characters to lanes of a vector register, the register is horizontal.
I think there’s a missing category: if you take the transpose, you can work on several related but independent calculations in parallel. eg, I’m working on a thing that wants to compare a query string with several keys, and I think it might make sense to swizzle the layout so the keys are interleaved and can be compared in parallel.
You mean something like this? I believe in the author categorization it would be vertical (“scalar” input, one “needle” per lane).
Edit: I remembered where I got the horizontal/vertical definitions. From Intel’s guide to
maddubs:So in Intel’s manuals horizontal means that the operation is between lanes of the same input vector, vertical means that it is between lanes of two different inputs.
Something like that yes.
He says it’s a draft taxonomy so I don’t expect it to cover everything or for its distinctions to be entirely clear.
I was mostly surprised because the author works at Intel. At the end of the the day, it’s a just name, doesn’t matter.