1. 32

  2. 2

    As memory (hah) serves, this is useful when doing things that require low-latency stuff like signal and audio processing.

    1. 1

      Circular buffers are definitely used all the time in audio, but actually not for the performance – they’re used because they’re lock-free as long as there’s only one writer and one reader. Any sort of blocking operation in the audio/realtime thread is verboten, so circular buffers provide safe ways of message passing or buffering audio to/from other threads.

    2. 2

      Instead of memfd_create() you can use the POSIX standard shm_open(), so

      memfd_create("queue_region", 0)


      shm_open("queue_region", O_RDWR|O_CREAT, 0600)

      Add ‘-lrt’ to your LDFLAGS and remember to shm_unlink() it when you’re done. Everything else stays the same, including the performance.

      1. 2

        I vaguely recall it being less effort to simply open /dev/zero and use a private mmap()ing of that.

        Of course if you are using this as an IPC between two processes you’ll have to use a regular file.

        1. 1

          I don’t think a private map would work here:


          Create a private copy-on-write mapping. Updates to the mapping are not visible to other processes mapping the same file, and are not carried through to the underlying file.

          1. 1

            Meant to say “use a regular file with MAP_SHARED”, good catch. :)

          2. 1

            It doesn’t seem like you can mmap /dev/zero. I get ENODEV “Operation not supported by device” when I try. (macOS)

            Edit, showing my work:

            #include <err.h>
            #include <fcntl.h>
            #include <stdlib.h>
            #include <sys/mman.h>
            int main() {
                int fd = open("/dev/zero", O_RDWR);
                if (fd < 0) err(1, "open");
                void *map = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
                if (map == MAP_FAILED) err(1, "mmap");
                return 0;
            1. 1

              You’re confusing /dev/null with /dev/zero.

              1. 1

                Oops, I had used /dev/zero when I first tried it then accidentally swapped it for /dev/null when I came back to give some code. Either way, the result is the same: ENODEV.

                1. 1

                  Must be some MacOS specific breakage, because it works on Linux.

        2. 1

          I wish I understood how all this worked. Why exactly is a mod operation slow? Why exactly is it faster to do this via page tables? Is it because the kernel is already doing this and it effectively requires zero additional work? Is it because the CPU can handle this in hardware?

          I guess I’ve got some research to do.

          1. 4

            Mod isn’t super slow, but you can avoid mod entirely without the fancy page tricks by defining your buffer to be a power of 2. For example, a 4KiB buffer is 4096 = 2^12, so you can calculate the wrap-around with ( cur + len ) & 4095 without using mod.

            You would still need two separate memcpy’s, and a branch for the wrap-around non wrap-around cases (which is normally not a big deal except when you’re racing against the highly optimized hardware cache in your MMU…)

            1. 3

              Branches, conditionals such as if/switch statements, can cause performance problems so if you can structure things to avoid this sort of thing you can get a considerable bump in speed.

              A lot of people look to software tricks to pull off speedups but this particular data structure can benefit directly from calling upon hardware baked into the CPU (virtual memory mapping).

              Most of the time you have a 1:1 mapping of a 4kB continuous physical memory block to a single virtual 4kB page. This is not the only configuration though, you can have multiple virtual memory pages mapping back to the same physical memory block; most commonly seen as a way to save RAM when using shared libraries.

              This 1:N mapping technique can also be used for a circular buffer.

              So you get your software to ask the kernel to configure the MMU to duplicate the mapping of your buffer (page aligned and sized!) Immediately after the end of the initial allocation.

              Now when you are at 100 bytes short of the end of your 4kB circular buffer and you need to write 200 bytes you can just memcpy()-like-a-boss and ignore the problem of having to split your writes into two parts. Meanwhile your offset incrementer remains simply:

              offset = (offset + writelen) % 4096

              So the speedup comes from:

              • removing the conditionals necessary to handle writes that exceed the end of the buffer
              • doing a single longer write, rather than two smaller ones

              So it is not really that the CPU is handling this in hardware and so it is faster, the hardware is doing actually no more work than it was before. The performance comes from more a duck-lining-up excercise.

              1. 2

                Modulo and division (usually one operation) are much slower than the other usual integer operations like addition and subtraction (which are the same thing), though I’m not sure I can explain why in detail. Fortunately for division by multiples of two, right shift >> and AND & can be used instead.

                For why doing this with paging is so efficient, it is because the MMU (part of the CPU) does the translation between virtual and physical addresses directly in hardware. The kernel just has to set up the page tables to tell the MMU how it should do so.

              2. 1

                The mmap voodoo seems a bit weird, though I don’t know the exact semantics we’re aiming for. Some of those calls would appear redundant or gratuitous.

                1. 1

                  Take note of the difference in offset being used.