1. 51
  1.  

  2. 15

    Modifying the microbenchmark by simply commenting out the 2nd memcpy shows that it’s not the 2 memcpys that make the “slow” path slow (compiled with gcc -O3):

    $ ./a.out 
    slow: 0.003500 microseconds per write
    fast: 0.000456 microseconds per write
    $
    

    Somewhat fantastically, that means that this line is the one responsible for the massive difference in time:

            const size_t part1 = min(MESSAGE_SIZE, BUFFER_SIZE - offset);
    

    I then tried compiling with clang (again with -O3), and:

    $ ./a.out 
    slow: 0.000000 microseconds per write
    fast: 0.000001 microseconds per write
    

    I suspect the loop gets optimised away entirely in this case (the buffer isn’t used afterwards, so why bother copying anything into ti at all…).

    Playing around a little more shows some other interesting results; if I (with gcc) use the correct -march=haswell for my desktop, the difference between fast and slow is significantly reduced, if I make the 2nd memcpy call conditional on whether it has anything to copy (if (part2)):

    $ ./a.out 
    slow: 0.000809 microseconds per write
    fast: 0.000465 microseconds per write
    

    I think this goes to show the benchmark is bad; I think the fixed 32-byte message size is one part of the problem (compilers can convert memcpy into an unrolled operation using MMX or SSE instructions for example). Eg godbolt shows the memcpy as simply:

            movaps  %xmm1, 64(%rsp,%rax)
            movaps  %xmm0, 80(%rsp,%rax)
    

    … but only if it doesn’t have to do the part1/part2 check. That explains why that case is so much faster. If the message was variable-sized, this wouldn’t happen.

    1. 4

      The magic is mostly here:

      q->buffer = mmap(NULL, 2*size, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
      ...
      mmap(q->buffer, size, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED, q->fd, 0);
      ...
      mmap(q->buffer+size, size, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED, q->fd, 0);
      

      Where 2*size is allocated for the buffer, then both halves of the buffer are mmap’d to the same file. This allows semantics to be something like buffer[ start + index ] = 'x', without the need for bounds checking.

      The article claims 3x speedup on write.

      In some sense, this is basically saying that the operating system is faster writing at duplicating writes than a program doing modular arithmetic and bounds checking?

      1. 6

        this is basically saying that the operating system is faster writing at duplicating writes than

        I don’t think this is duplicating writes - the mmap calls are just aliasing, right? The fundamental magic is what q->fd points to, that’s the only real memory allocated.

        Eg. this:

        q->fd = memfd_create("queue_buffer", 0);
        ftruncate(q->fd, size);
        

        Is the only real, actual allocation that happens. The mmap calls are then set up to create a messed up virtual memory addresses, where both the address at q->buffer and the address at q->buffer + size actually point to the same real memory.

        Once you get to the end of the first page table section, at q->buffer + size - 1, and you write at q->buffer + size, what actually happens is that you just get teleported to the beginning of q->fd again, magically handling the wrap-around.

        I would guess the reason this is faster is because that this is hardware optimized because of https://en.wikipedia.org/wiki/Translation_lookaside_buffer

        Which begs the question: Is this actually faster, when a real program has lots of other things contending for the TLB?

        Either way, super cool. Seems like a sure-fire way to find kernel bugs!

        Edit: Re-reading your comment, realizing you’re saying the same thing I’m saying, and I think I’m just misunderstanding what you mean by “duplicating” writes.

        1. 4

          The magic is mostly here

          Yes

          In some sense, this is basically saying that the operating system is faster writing at duplicating writes than a program doing modular arithmetic and bounds checking?

          The idea is that once the relevant pages are faulted in, the operating system isn’t involved and the MMU’s address translation hardware does it all.

          The fact that the microbenchmark at the end of the post doesn’t actually use the queue is a liiiiitle suspicious. But queues are inherently very difficult to benchmark.

          Edit: no wait, it’s multi threaded queues that are difficult to benchmark. (Because tiny changes in the average balance between producer and consumer threads can wildly change perf.) No reason why single threaded queues shouldn’t be benchmarked.

          1. 1

            I imagine most of the speedup comes from being able to use memcpy() instead of a for loop for copying the data in and out of the queue.

            1. 3

              The “slow” implementation that is ultimately compared does use memcpy instead of a for loop. They calculate the end using a single modular arithmetic call, then do two memcpy’s

              Here’s the relevant code snippet:

              memcpy(data,         q->buffer + q->head, part1);
              memcpy(data + part1, q->buffer,           part2);
              
          2. 3

            Mod is always slow. How does this compare with using power-of-two-sized buffers and using and to mask indices?

            1. 2

              I imagine much faster, if you’re writing more than a byte at a time. Not only do you not have to switch to a slower mode when writing near the “end” of the buffer, but you also get to skip the logic to determine when to switch to slow mode.

            2. 3

              Less magic, but some might also like Bip Buffers.

              Supports differently sized chunks without splitting (splitting can be costly), and allows reading and writing at the same time with simple atomics.