1. 43
  1.  

    1. 10

      Looks like an explanation has been found https://news.ycombinator.com/item?id=42580547

      1. 6

        I didn’t see one yet, so filed https://github.com/llvm/llvm-project/issues/121546 to get LLVM to try hard to avoid this.

        May also be able to ask some Intel folks there for a better understanding of the issue.

        1. 4

          Basically, when the shift instruction is set using the wrong kind of instruction, it causes a 3x throughput hit.

          The 3x hit is in latency. Throughput gets a 2x hit, where the instruction can only run once per cycle instead of twice.

        2. 4

          Back in the days of the Pentium chips, the CPU would remember if it was certain that the high 16 bits of a 32-bit register had been explicitly cleared (e.g. by XORing it with itself), and if so then a subsequent operation on the low 16 bits only would still be fast. It sound like something similar is happening here. An explicit zeroing of the high bits of the 64-bit register (e.g. via an XOR, or a MOV using the 32-bit version of the register) gets the fast shift, otherwise the CPU falls back on a slower shift process that looks at all bits of the register.

          1. 5

            Yeah I was thinking along those lines at first too. I believe there is some amount of renaming of subregisters (e.g. top/bottom half of RAX) that goes on. However, that isn’t the whole picture, because the top half doesn’t need to be 0 for it to be fast! E.g. this is still fast:

                    MOV RCX, -1
                    MOV RCX, RCX
            

            Regardless, it is very strange for SHLX to care at all about the top bits. Only the lowest 6 bits actually affect the instruction. There’s no reason for a dependency on the high bits at all.

            1. 3

              Good news is compilers are unlikely to emit a 64-bit immediate load for the shift count.

              1. 2

                True! But things like this still trigger it, which might be more likely:

                INC RCX
                SHL RAX, CL
                
                1. 4

                  Wait, so it isn’t a SHLX anomaly, but a general shift anomaly for certain ways of creating the shift amount???

                  In some ways, this makes sense… its not like there are two unrelated ports to execute on…

                  I actually wonder if part of the problem is having to use CL for the non-SHLX variant…

                  1. 6

                    I actually wonder if part of the problem is having to use CL for the non-SHLX variant…

                    That’s the most plausible explanation I’ve seen so far. If you designed for SHL then maybe you only put 16 wires in for the shift port in the barrel shifter.

                    The fast sequences seem to be ones where you can tell at the decode stage that the register is 16 bits, with a few missing cases. These are all annoying in register rename because you have to handle writes to AH as masked stores generating a new rename register, if you know that the AH bits are zero then you can do a little bit more reordering. Possibly the RAX case is ignored because you rarely write RAX and AH as the same rename register.

                    If you then add SHLX, reusing an existing SHL pipeline, the simplest thing to do is read the L subregister. If the rename engine knows that it has a consistent view of L, this gets forwarded faster. If not, perhaps there’s another interlock? I can’t imagine you’d actually want to move the whole 32 or 64 bits for an operation that uses, at most, the low 6 bits. This doesn’t seem very likely.

                    I wondered if there was some overflow behaviour, but both SHL and SHLX are specified to mask the shift operand. If SHL trapped on CL having more bits, I can imagine an SHLX implementation needing all of the bits in the register to check for this trap condition and then discarding the result if it’s the X variant.

                    My most likely guess is that Intel uses different-sized rename registers and has a pool of 16-bits ones. I’ve met a microarchitect who liked doing this: the area savings are usually small because the control logic is big for register rename and vector registers dwarf any other savings, but it might be that functional units that take 16-bit operands are next to a pool of rename registers for H and L registers and if you decode an E / R register instruction with the H or L variants live you inset a move from the small pool to the large pool. If the 16-bit registers are close to the SHL pipeline and SHLX is reusing the pipeline, you have to get the register from further away and simply getting a signal from one side of the core to the other is a multi-cycle operation at these speeds. If they’ve optimised layout for these functional units getting 16-bit values, that seems plausible.

                    Even without that, there may only be a single 64-bit path from the big rename file to the SHL unit, so you can fetch a 64-bit and 16-bit operand in parallel but need to fetch two 32-bit ones sequentially.

                    Three cycles is a bit odd, so my guess is that there’s actually one 16-bit path from 16-bit rename registers and one 32-bit path from the 32/64-bit file(s?). You can do the two halves of a shift in parallel so doing the two fetches of the first source in adjacent cycles still lets you dispatch one per cycle (and skip the second half of the fetch for the 32-bit version). If you need to fetch the shift operand from the 32/64-bit register file then it takes two cycles.

                    This is pure speculation and is the kind of think I wouldn’t expect anyone to design but if you’re favouring reuse (which you often do to reduce DV costs, which are easily 90% of CPU design costs) then I can kind-of see how you’d end up there. Especially if the split rename file design is a speedup running mostly 32-bit code (and last time I ran Windows, it still shipped with a bunch of 32-bit system services).

                    1. 6

                      The idea of knowing that AH is zero was what I thought of until I saw that even this sequence (from up thread) was fast:

                              MOV RCX, -1
                              MOV RCX, RCX
                      

                      Which… clearly has non-zero AH bits…

                      And it’s not just the 32/64-bit register file read split because non-immediate-writes to 64-bits (like the second MOV above) are fast. It seems something to do with the immediate 64-bit immediate operand writes – maybe the sign extension to the high 32-bits causes an extra stall? But if that’s it, I’d expect to see it in many more places than just shifts…

                      Maybe an Intel person can chime in on the LLVM bug…

                      1. 3

                        I wouldn’t be surprised if both all zeroes and all ones are both special cases, so writing -1 to RAX gives you a -1 AH and the canonical all-ones rename register (or even just the canonical all-ones rename register, which you don’t need to fetch because its ID is hard coded). The other cases confuse me though.

                        1. 1

                          While MOV RCX, -1 gets handled at the renamer level, MOV RCX, RCX does not (moves with the same src/dst never get eliminated on Intel). The issue here seems definitely related to operand sources that come from the renamer instead of the output of an execution unit. Besides SHL(X)/SHR(X), BZHI also has the same behavior. Furthermore, the issue is not related to the shift amount: replacing SHLX RAX, RAX, RCX with SHLX RAX, RCX, RAX has the same behavior.

                          In particular you can avoid immediates all together and get the same behavior:

                          XOR RCX,RCX
                          INC RCX
                          

                          will get the performance hit because in Alder Lake / Golden Cove arithmetic with small operands gets handled at the renamer level. Yes, Golden Cove lets you run INC RCX; INC RCX; INC RCX; INC RCX; INC RCX; INC RCX; in a single cycle! But for some reason this optimization only works for 64-bit registers; if you use INC ECX it goes back to old behavior.

                      2. [Comment removed by author]

                  2. 2

                    True. That argues that, whatever the root cause, it’s a mistake in the design.