1. 22

  2. 6

    The MSRC article mentioned in the comments is also extremely interesting: Building Faster AMD64 memset Routines. In particular, the post does a great job of explaining performance despite how opaquely modern CPU features behave (cache lines, branch prediction, speculative execution, etc.).

    Sidebar: I love the idea of optimizing at the scale of single instructions and yet have an effect on the total performance of the system.

    1. 1

      The automemcpy paper at ISMM this year was also interesting (and the code from it is now being merged into LLVM’s libc). The most surprising thing to me from both their work and Joe’s experiments on Windows was that 0 is one of the most common sizes for memcpy.

    2. 1

      If memset is a hot instruction (that is, it’s being called frequently), why didn’t that get moved to hardware? Something like an instruction to memory to zero out big aligned, power of two chunk if memory?

      1. 3

        If memset is a hot instruction (that is, it’s being called frequently), why didn’t that get moved to hardware?

        Block copy and clear instructions have a long history in hardware; here’s a comp.arch thread that discusses some of the difficulties involved with implementing them.

        Also, the ARM instruction set was recently modified to add memcpy (CPYPT, CPYMT, CPYET) and memset (SETGP, SETGM, SETGM) instructions.

        1. 1

          Neat. I didn’t know about these new ARM instructions.

          That blog post mentions that they are also making NMIs a standard feature again.


        2. 2

          why didn’t that get moved to hardware?

          It did. See x86’s ‘rep’ family of instructions, ‘clzero’, etc. Rep has a high startup cost; it has gotten better but it is still not free. (Clzero is very specialized and lacks granularity.) The technique implemented by the linked post aims to improve performance of small-to-medium-sized memsets, where you can easily beat the hardware. The calculus is complicated by second-order effects on i$/btb (e.g. see here, sec. 4.4, and note that ‘cmpsb’ is never fast). My own implementation is slower at very small sizes, but ~half the size/branches of the linked version.

          Such small sizes are empirically very rare; but application-side specialization can nevertheless clean up the chaff. Dispense with ‘memset’ entirely and statically branch to memset_small memset_medium clear_page etc. Overgenerality is the bane of performance. Compare the performance of malloc vs your own purpose-built allocator, or dumb virtual dispatch vs closed-world matching or inline-cached JIT (which amounts to the same thing).

          1. 1

            you can easily beat the hardware

            Fun fact: at one point, software popcnt could go faster than hardware popcnt.

            Here is another demonstration of the way specialization can improve performance vs generic solutions.

            1. 1

              rep is an interesting command, but I think I was not clear in my question. I was wondering why the option to clear out chunks of memory didn’t move to memory itself? Repeating something from the CPU still takes a lot of roundtrips on the memory lane, and latencies add up. If something is performance critical, why not do it on the very edge, which in this case is the memory chip/board itself.

              1. 2

                Heh, everybody wants their problems to run directly on memory!

                1. ‘Shows up on a profiler’ ≠ ‘performance critical’. The complexity is just not worth it, esp. as it is far-reaching (what are the implications for your cache coherency protocol?)

                2. Again, the things that were showing up on the profiler were not bandwidth-limited; they fit comfortably in L1. Touching main memory at all would be extremely wasteful

                3. There are some bandwidth-limited problems. The most obvious example is kernels needing to zero memory before handing it out to applications. But the performance advantage is not there; memory is written to many more times than it is mapped. Dragonflybsd reverted its idle-time zeroing

                1. 2

                  DDRwhatever sticks are simple memory banks with no logic in them, you can’t move anything into them.

                  The memory controller is in your SoC (was in the northbridge in the old times). Moving the command operation just into the controller I guess doesn’t win much.

                  Now, this might make some sense if you move memory controller to be remote again, talking over a higher-latency serial link (hello IBM) I guess.

                  1. 1

                    You often don’t want this to move to the memory directly because you’re setting the contents of memory that you’re about to use or have just used. In either of those cases it either wants to be, or already is, in the cache. At a minimum, you’d need CPU instructions that invalidated the cache lines that were present and then told memory to set the pattern in a range.

                2. 1

                  I think rep stosq will work on x86. But that doesn’t mean it’s fast.

                  1. 1

                    It is guaranteed to be quite fast if your cpuid has the ERMS flag (Enhanced REP MOVSB). That would be >=IvyBridge on the Intel side, and only >=Zen3 on AMD.