This is to a large extent what’s always made x86 SSE programming so fun, I feel: it has a wildly non-orthogonal instruction set filled with some things that feel way faster than they should be given their power. Examples include pmaddwd/ubsw, psadbw, pshufb, pmovmskb, etc. Some of these have no fast equivalents on other architectures (like NEON), so porting code from SSE gets particularly hairy. Making fast code (particularly integer code) in SSE ends up being all about finding obnoxiously clever ways to utilize those things in unexpected ways (and to work around SSE’s endless flaws).
Of course, having to constantly find clever ways to work around things is a good argument that SSE is supremely poorly designed ;-)
(The bit in the article about -why- MMX, and accordingly, integer SSE, was designed in such a weird way is particularly interesting; the goal was minimizing instruction count instead of creating something orthogonal and complete. A lot of the “bad design” of x86 comes from the fact that, like that, it was designed “well” – for a different goal than the one people nowadays actually want.)
This is to a large extent what’s always made x86 SSE programming so fun, I feel: it has a wildly non-orthogonal instruction set filled with some things that feel way faster than they should be given their power. Examples include pmaddwd/ubsw, psadbw, pshufb, pmovmskb, etc. Some of these have no fast equivalents on other architectures (like NEON), so porting code from SSE gets particularly hairy. Making fast code (particularly integer code) in SSE ends up being all about finding obnoxiously clever ways to utilize those things in unexpected ways (and to work around SSE’s endless flaws).
Of course, having to constantly find clever ways to work around things is a good argument that SSE is supremely poorly designed ;-)
(The bit in the article about -why- MMX, and accordingly, integer SSE, was designed in such a weird way is particularly interesting; the goal was minimizing instruction count instead of creating something orthogonal and complete. A lot of the “bad design” of x86 comes from the fact that, like that, it was designed “well” – for a different goal than the one people nowadays actually want.)
For the uninitiated:
https://en.wikipedia.org/wiki/Streaming_SIMD_Extensions (SSE)
https://en.wikipedia.org/wiki/SIMD (Single Instruction, Multiple Data)