1. 7
  1.  

  2. 1

    not mentioned is the utilization of cache

    not a very fast matrix multiplication, then :P

    1. 1

      Well it’s nearly identical reasoning to register saturation (modulo the often inscrutable but largely benign eviction policies of varying hardware) - so I felt it was fair to leave it as an exercise to the reader :)

      1. 1

        On a single core, a good matrix product is 99% cache management, 1% parallelism. The post does its readers a disservice by minimising this point.

        1. 1

          Citation?

          1. 1

            Personal experience… :P

            Here’s a simple example. I get the following timings:

            • naive: 5.742471
            • simple vectorised: 0.150027
            • simple local: 0.316796
            • blas singlethreaded: 0.038781

            ‘Simple local’ just does the same loop transposition as the simple vectorised approach.

            1. 1

              results for me (writeup below)

              • naive: 2.381266
              • simple vectorized: 0.346979
              • 12 registers used: 0.180736
              • 15 registers used: 0.168493
              • numactl -C2 blas: 0.158050

              You might want to double check that blas is running single threaded (maybe use numactl or taskctl) - 55gflops with double precision is extremely fast for avx2 code. My machine’s blas runs in 0.157807s or ~15gflops. If you have an avx512 machine then it make sense, but you should be using the 512 intrinsics for comparison :). Also, you should use fmadd, which is around 2x faster than mul+add as separate instructions.

              Disregarding the FMA instruction, the simple vectorized code doesn’t quite capture everything in the animations.

              You’ll want to use more registers in the inner loop. Right now your code uses a single broadcast register (the red matrix in the animation) and then immediately loads and stores partial sums of the blue matrix. We’ve got 16 registers to work with, and we’re using ~2-3, so there’s some speed to be had. That’s what the last animation is all about.

              Instead, we should load some red registers and some blue registers, compute their outer product (in yellow registers) and then write that out to memory. This can be done by loading 4 times from the red matrix (broadcasting to 4 registers), 2 times from the blue matrix (contiguous loads) and then storing the outer product in 8 registers. This adds up to 14 registers, which is just shy of the 16 register limit that we have for avx2. That code can be found below in the function called “mm_12_registers.” It runs in about 0.180736s, or 12gflops per core.

              But why is it called 12 registers when I clearly use 14? Well avx2 fma instruction actually have the ability to use a different port on the CPU to pull directly from memory. So gcc should emit instructions that only use 12 registers (if we massage it right… I prefer using asm directly for this kinda stuff).

              That means we have 4 extra registers - which we can almost use if we load 5 broadcast red matrix values and store 10 running outer products. That can be found in mm_15_registers and that runs in 0.168493s for me (12.7gflops).

              The remaining ~15% of performance is certainly cache management. The trick is to actually write intermediate memory into a local buffer to do the exact same “reuse logic” we just did with registers but instead with real memory.

              Code: http://ix.io/3YjF

              1. 1

                use more registers in the inner loop

                Yes … definitely more performance to be had. I just wanted to show that having good locality is what makes the biggest difference, vs naive traversal order.

                Also, you should use fmadd, which is around 2x faster than mul+add as separate instructions.

                2x better throughput. Comparable latency. Which is sort of proving my point; the simple vectorised code I showed is mainly limited by latency, and runs at the same speed if I use fma.

                double check that blas is running single threaded (maybe use numactl or taskctl) - 55gflops with double precision is extremely fast for avx2 code. My machine’s blas runs in 0.157807s or ~15gflops. If you have an avx512 machine then it make sense, but you should be using the 512 intrinsics for comparison :)

                I have avx2—no avx512, sadly—and definitely running on one thread. I get the same result with numactl -Ck as with the envvar, and a much faster result if I leave off any thread limiting. blas on one thread is 2.3x faster than your mm_15_registers. Is your blas specialised to your uarch?

                Anyway, call it ~28 fma per ns. That is 4 elements per vector×2 fma per cycle×3.5 cycles per ns; seems about right. I think my CPU boosts to 4.2ghz or so, so that leaves space for housekeeping.

                write intermediate memory into a local buffer to do the exact same “reuse logic” we just did with registers but instead with real memory

                A big part of the advantage from multi-level blocking comes from reducing cache pollution.

                massage it right… I prefer using asm directly for this kinda stuff

                Yeah, c is annoying. But slightly less hassle for a quick hack.