1. 20

  2. 4

    A better algorithm is no longer enough to get top performance; your program needs to join in the dance of the million-transistor accelerators. Anybody who insists C is close to the machine is, at best, deluded.

    Great stuff. This “swap_if” idea is something I’m going to try applying to some of my code.

    1. 1

      Yes. C is close to assembly, but assembly is a pretty fat abstraction layer around what the CPU is doing.

      At the same time, I don’t think most people want to program pipeline bubbles and cache heirarchies directly.

    2. 2

      Back in 2000, AMD included cmov in its 64-bit x86 ISA extensions. Then, Intel had to adopt them when Itanium flopped.

      The first sentence is technically true, in the sense that AMD64 did include cmov, but the instruction was originally introduced in 1996 by Intel with the Pentium Pro, and became widespread with the Pentium II.

      1. 11

        The Alpha also had a conditional move as well. RISC-V doesn’t have one. It’s a very interesting microarchitectural trade-off.

        On a low-end core, a conditional move can require an extra port on the register file (this was why it was such a large overhead on the Alpha): it requires you to read three values: the condition (cheap if you have condition codes, a full register if you don’t), the value the source register to be conditionally moved, the value of the destination register that may need to be written back to if the condition is false.

        On a high-end core, you can fold a lot of the behaviour into register rename and you already have a lot of read ports on rename registers so conditional move isn’t much overhead.

        In both cases, conditional move has a huge effect on the total amount of branch predictor state that you need. You can get away with significantly less branch predictor state with a conditional move than without and get the same performance - a modern compiler can transform a phenomenal number of branches into conditional moves. The total amount varies between pipeline designs (it’s been years since I measured this, but I think it was about 25% less on a simple in-order pipeline, more on an out-of-order one).

        Once you have sufficient speculative execution that branch predictor performance becomes important to performance, conditional move becomes incredibly important for whole-system performance. For x86 chips, it would probably start to make sense around the Pentium, given a modern compiler. Compilers in the ’90s were much less good at the if conversion than they are now and so it may not have made very much difference when it was introduced in the Pentium Pro that it would do on an equivalent pipeline today.

        Arm went all-in on predication from the first processor and so managed without any branch prediction for much longer than other processors. It’s much easier for a compiler to do if conversion if every instruction is conditional than if you just have a conditional move. AArch64 dialled this back because compilers are now much better at taking advantage of conditional move and microarchitects really hate predicated loads (compiler writers, in contrast, really love them, especially for dynamic languages).

        1. 2

          Or the fact AMD64 came out a few years later than that…

        2. 1

          What’s old is new again, I guess.

          I have a copy of Michael Abrash’s Zen of Code Optimization, filled with antidotes and optimization techniques similar to this, but down to the assembly level for the 386, 486, and (then new) original Pentium. A lot has changed, but the basic idea still applies that more speed can be squeezed out by thinking outside the box and writing clever code when necessary.

          Always interesting to read about, even if I rarely need to do it.

          1. 1

            Compilers can do what even best programmers can’t when it comes to optimization, and in most cases, you have other priorities than wasting time on minute optimisations.