1. 29
  1.  

  2. 8

    That is, until you need to read many files and could therefore reuse buffers: https://blog.burntsushi.net/ripgrep/

    1. 7

      Yeah, basically what I’ve noticed is that if you’re searching a lot of tiny files in parallel, then the overhead of memory mapping becomes significant. And also, I did find that this differs on platforms, but it’s been a while since I tried that.

      Also, I’ve noticed more recently, that at least when comparing “ripgrep with memory maps” and ag (which uses memory maps by default and has no efficient non-mmap implementation), ripgrep can be stunningly slower:

      $ time rg PM_RESUME --no-mmap | wc -l
      19
      
      real    0.148
      user    0.865
      sys     0.715
      maxmem  11 MB
      faults  0
      
      $ time rg PM_RESUME --mmap | wc -l
      19
      
      real    1.568
      user    1.975
      sys     4.229
      maxmem  38 MB
      faults  0
      
      $ time ag PM_RESUME | wc -l
      19
      
      real    0.697
      user    1.010
      sys     2.401
      maxmem  31 MB
      faults  0
      

      So with memory maps enabled, ripgrep is an order of magnitude slower than by disabling them and twice as slow as ag. Looking at the syscalls shows some really interesting differences, but I don’t really grok it yet and haven’t dug into it:

      $ strace -c -f rg PM_RESUME --no-mmap | wc -l
      19
      % time     seconds  usecs/call     calls    errors syscall
      ------ ----------- ----------- --------- --------- ------------------
       50.72   17.066187          58    291604           read
       15.38    5.174976        1285      4025      1108 futex
       13.22    4.449014          59     75103         1 openat
       12.82    4.313064          57     75102           close
        4.35    1.463780          61     23667     23332 statx
        2.41    0.811730          86      9427           getdents64
        0.82    0.276151          58      4719           fstat
        0.20    0.068239        1240        55           clock_nanosleep
        0.05    0.015889          47       332           mprotect
        0.00    0.001034          26        39           sigaltstack
        0.00    0.000699          58        12           clone
        0.00    0.000694          53        13           getrandom
        0.00    0.000680          56        12           write
        0.00    0.000616           9        64           mmap
        0.00    0.000573          15        37           rt_sigprocmask
        0.00    0.000350          12        29        28 ioctl
        0.00    0.000257           8        29           sched_getaffinity
        0.00    0.000244          10        24           munmap
        0.00    0.000199          19        10           brk
        0.00    0.000150          11        13           set_robust_list
        0.00    0.000118           9        13           madvise
        0.00    0.000107         107         1           mremap
        0.00    0.000045           6         7           rt_sigaction
        0.00    0.000014           7         2           prlimit64
        0.00    0.000008           8         1           set_tid_address
        0.00    0.000007           3         2         1 arch_prctl
        0.00    0.000005           5         1           getcwd
        0.00    0.000000           0         5           pread64
        0.00    0.000000           0         1         1 access
        0.00    0.000000           0         1           execve
      ------ ----------- ----------- --------- --------- ------------------
      100.00   33.644830          69    484350     24471 total
      
      $ strace -c -f rg PM_RESUME --mmap | wc -l
      19
      % time     seconds  usecs/call     calls    errors syscall
      ------ ----------- ----------- --------- --------- ------------------
       18.21    7.336984          78     93780     23332 statx
       18.00    7.251364         103     70106           munmap
       15.39    6.200287          88     70146           mmap
       15.20    6.121962        1570      3897       984 futex
       14.93    6.013308          80     75103         1 openat
       14.48    5.834280          77     75102           close
        2.57    1.034630         109      9427           getdents64
        0.90    0.362139          76      4719           fstat
        0.13    0.051224         868        59           clock_nanosleep
        0.12    0.046572          76       609           read
        0.06    0.026085          73       354           mprotect
        0.00    0.001083          83        13           madvise
        0.00    0.000950          79        12           write
        0.00    0.000817          20        39           sigaltstack
        0.00    0.000507          42        12           clone
        0.00    0.000479          16        29           sched_getaffinity
        0.00    0.000383          13        29        28 ioctl
        0.00    0.000373          10        37           rt_sigprocmask
        0.00    0.000169          13        13           set_robust_list
        0.00    0.000077          77         1           mremap
        0.00    0.000062           6        10           brk
        0.00    0.000026          26         1           getcwd
        0.00    0.000014           1        13           getrandom
        0.00    0.000000           0         7           rt_sigaction
        0.00    0.000000           0         5           pread64
        0.00    0.000000           0         1         1 access
        0.00    0.000000           0         1           execve
        0.00    0.000000           0         2         1 arch_prctl
        0.00    0.000000           0         1           set_tid_address
        0.00    0.000000           0         2           prlimit64
      ------ ----------- ----------- --------- --------- ------------------
      100.00   40.283775          99    403530     24347 total
      
      $ strace -c -f ag PM_RESUME | wc -l
      19
      % time     seconds  usecs/call     calls    errors syscall
      ------ ----------- ----------- --------- --------- ------------------
       16.01    2.362632          25     93446     18590 openat
       14.78    2.180154          31     69836           munmap
       12.90    1.903322          25     74625        33 stat
       12.49    1.843211          24     74858           close
       12.30    1.814280          24     74859           fstat
       12.18    1.796569          25     69910           mmap
       11.90    1.755856          25     69828           madvise
        4.93    0.726686         752       966        85 futex
        2.37    0.349085          37      9427           getdents64
        0.08    0.012168          21       565           read
        0.05    0.007254        2418         3         1 wait4
        0.00    0.000411          12        33           brk
        0.00    0.000379          31        12           write
        0.00    0.000196          65         3           execve
        0.00    0.000181           1       149           rt_sigaction
        0.00    0.000146           3        48           rt_sigprocmask
        0.00    0.000139           3        37           mprotect
        0.00    0.000090          10         9           clone
        0.00    0.000075           3        21         7 access
        0.00    0.000057           6         9           geteuid
        0.00    0.000057           6         9           getegid
        0.00    0.000056           6         9           getuid
        0.00    0.000056           6         9           getgid
        0.00    0.000035           2        14           pread64
        0.00    0.000017           2         6         3 arch_prctl
        0.00    0.000014           1        11         3 ioctl
        0.00    0.000012           4         3           getpid
        0.00    0.000009           4         2           dup2
        0.00    0.000009           9         1           sysinfo
        0.00    0.000008           8         1           getppid
        0.00    0.000006           6         1           uname
        0.00    0.000006           6         1           getpgrp
        0.00    0.000006           1         5           prlimit64
        0.00    0.000000           0         4         3 lstat
        0.00    0.000000           0         1           rt_sigreturn
        0.00    0.000000           0         1           fcntl
        0.00    0.000000           0         3           getcwd
        0.00    0.000000           0         7           sched_setaffinity
        0.00    0.000000           0         2           set_tid_address
        0.00    0.000000           0         9           set_robust_list
        0.00    0.000000           0         1           pipe2
      ------ ----------- ----------- --------- --------- ------------------
      100.00   14.753182          27    538744     18725 total
      

      So from my view, there is still some mystery here, which makes me cast doubt on my own conclusions.

      EDIT: Oh man, this is awesome, disabling parallelism while keeping memory maps enabled actually makes ripgrep noticeably faster:

      $ time rg PM_RESUME --mmap -j1 | wc -l
      19
      
      real    0.979
      user    0.485
      sys     0.485
      maxmem  18 MB
      faults  0
      

      EDIT(2): Looking at the syscalls above, why are there so few mmap calls for both ripgrep and ag? They’re both memory mapping every file, so there should be a lot more…

      EDIT(3): Derp. Forgot I had to add the -f/--follow-forks flag to strace if I want accurate counts in a multi-threaded program.

      1. 4

        @burntsushi - the most likely slow down culprit (IMO) is cache sharing dynamics amongst all the threads..most likely L3, but in a hyper-threaded deployment also L1/L2. A start (but by no means end) in assessing that situation might be:

        perf stat -B -e cache-references,cache misses,cycles,instructions,branches,faults,migrations rg blah
        

        and it’s –no-mmap equivalent. I’d bet your slow runs have a lot more cache misses. I’d bet you could make your no-mmap emulate the undesirable slowdown with larger IO buffers.

        1. 4

          Thanks! But I don’t think that explains why ag remains twice as fast when it’s using the same technique? You do appear to be right though, there are an order of magnitude more cache misses when memory maps are used:

          $ perf stat -B -e cache-references,cache-misses,cycles,instructions,branches,faults,migrations rg PM_RESUME --no-mmap | wc -l
          
           Performance counter stats for 'rg PM_RESUME --no-mmap':
          
                  18,036,122      cache-references:u
                     242,749      cache-misses:u            #    1.346 % of all cache refs
               2,041,849,245      cycles:u
               2,340,869,083      instructions:u            #    1.15  insn per cycle
                 455,149,760      branches:u
                       1,544      faults:u
                           0      migrations:u
          
                 0.148562216 seconds time elapsed
          
                 0.823037000 seconds user
                 0.796361000 seconds sys
          
          
          19
          
          $ perf stat -B -e cache-references,cache-misses,cycles,instructions,branches,faults,migrations rg PM_RESUME --mmap | wc -l
          
           Performance counter stats for 'rg PM_RESUME --mmap':
          
                  61,882,279      cache-references:u
                  13,438,679      cache-misses:u            #   21.717 % of all cache refs
               2,155,033,904      cycles:u
               2,141,213,061      instructions:u            #    0.99  insn per cycle
                 415,617,039      branches:u
                      77,393      faults:u
                           0      migrations:u
          
                 2.207521305 seconds time elapsed
          
                 1.798868000 seconds user
                 4.793213000 seconds sys
          
          
          19
          
          $ perf stat -B -e cache-references,cache-misses,cycles,instructions,branches,faults,migrations ag PM_RESUME | wc -l
          
           Performance counter stats for 'ag PM_RESUME':
          
                  28,093,417      cache-references:u
                  15,622,025      cache-misses:u            #   55.607 % of all cache refs
               2,540,363,298      cycles:u
               4,390,021,379      instructions:u            #    1.73  insn per cycle
                 843,978,973      branches:u
                      77,101      faults:u
                           0      migrations:u
          
                 2.228702501 seconds time elapsed
          
                 1.933147000 seconds user
                 4.275776000 seconds sys
          
          
          19
          

          Interestingly, when running rg and ag via perf stat, they both wind up taking right around the same wall clock time. Without perf stat, ag is consistently twice as fast as rg --mmap. The plot thickens… But at least some part of it might be answered?

          1. 4

            13 million DIMM loads is roughly the missing time: 13e6*100e-9 =~ 1.3 seconds. Your latency could be better/worse. You can run a little Nim program (or write your own Rust/C/whatever if Nim makes your eyeballs bleed) to measure the latency of your DIMMs. Typical numbers are 60..150 ns. Most people are more aware of their DIMM bw than latency. Intel also distributes some program that does the MSR updates to temporarily disable pre-fetching which is a more direct approach than that little program.

            I’ve never used ag. Are you sure it’s using the same number of threads/same cache competition/going over the same small files in roughly the same order? That’s the next thing to check. I would also try to vary parallelism in both, particularly doing just one thread per real core vs. per core thread, to study the problem.

            1. 6

              I’ve never used ag. Are you sure it’s using the same number of threads/same cache competition/going over the same small files in roughly the same order?

              Yup, I think you hit the nail on the head here. If I use ag PM_RESUME --workers 12, then it slows down to about ripgrep’s speed. If I use rg PM_RESUME --mmap -j8, then it speeds up to about ag’s speed. Namely, ag caps its number of threads to 8 when auto-detecting the number of threads to use.

              Very interesting! Thank you for updating my mental model. :-)

              1. 3

                No problemo. You may be able to help the CPU along with (CPU-specific) cache control instructions..Basically tell it earlier data in the mmap the regex is done with can be preferentially evicted to “enlarge the available cache”. Not sure. Would probably require experimentation/digging into rg/regex/more careful thought.

                1. 3

                  Yeah, in practice I just settled on a heuristic: if ripgrep is known to be searching a small number of files, it uses memory maps. Otherwise it uses standard read syscalls. Using memory maps only provides a somewhat small boost for very large files. e.g., This one is 13GB and in the I/O cache:

                  $ time rg 'Sherlock Holmes' OpenSubtitles2018.raw.en -c --mmap
                  7673
                  
                  real    1.754
                  user    1.235
                  sys     0.516
                  maxmem  12510 MB
                  faults  0
                  
                  $ time rg 'Sherlock Holmes' OpenSubtitles2018.raw.en -c --no-mmap
                  7673
                  
                  real    2.663
                  user    1.130
                  sys     1.530
                  maxmem  6 MB
                  faults  0
                  
                  1. 3

                    Fair enough. In terms of the original article here, there has been a lot of evolution of the cost of system call overhead. It’s had an uptick in recent years due to Meltdown, Spectre and the like. I’m not sure why something like https://github.com/c-blake/batch has not become standard/popular/etc.

                    That kind of thing does not really help “buffered IO” since you still need a buffer size and should just pick the right one in the first place. However, you could easily have an open+fstat+mmap+close 4-tuple batch system call to streamline that procedure. You were measuring it a bit with strace. :-)

              2. 3

                (I just realized my text might confuse..not all cache misses are created equally. Some the CPU stalls on while others can occur “in the background”, masking the latency which is, of course, the whole point of pre-fetching. But the number in this case is suspiciously (probably) almost exactly the missing time.)

              3. 1

                IIRC, mmap/munmap are very expensive operations that performs a global lock and suspend all threads in the same process. Many small files is an anti-pattern for this since all your threads are suspended and spend their time asleep waiting for said lock as soon as a file is processed/evicted.

                1. 2

                  Also perhaps worth mentioning is that if a file size is smaller than an IO buffer then there is no system call saving for the mmap method. open+fstat+read+read0+close vs open+fstat+mmap+close+munmap is the same. Not to oversimplify that all system calls take the same time or anything, but to perhaps offer a possibly stronger heuristic for @burntsushi. (Of course, as the article mentions the memory copy may still be faster with mmap. That may be Linux-specific, though.)

                  1. 1

                    Right. That was definitely in my mental model, and might very well contribute to the slowdown when comparing rg --mmap with rg --no-mmap. But the number of mmap/munmap calls are roughly equivalent between rg --mmap and ag, and that’s what was perplexing me above. See this comment though where it turned out that I could make them execute at roughly the same speed by making sure they both spin up the same number of threads.

                    1. 2

                      The more thread you have, the higher the probability that you get stalled; a self inflicted DDoS. That’s all speculative, maybe I’m totally wrong and no locking is performed! This is the kind of hypothesis that could be verified with a visual GUI like Intel vtune.

                      1. 1

                        Yeah there is locking, but it’s the same between --mmap and --no-mmap.

                        1. 1

                          I believe @fsaintjacques means locking/suspension in the kernel, not in rg. I’m not sure what happens as it’s late and I’m sleepy, but because the program is one process but multi-threaded sharing the same virtual memory being editted by the mmap (& munmap, too) it is not so implausible. Similar parallelism with a multi-process model (perhaps communicating the ripgrep’d output over a pipe to a parent controller) would be one re-design immune to this problem. (EDIT: it is also easy to test this out with two simple programs - a reduced ripgrep and this multi-process design that “just add up all the bytes” in files.)

                          1. 3

                            Correct, I mean kernel locking + all threads suspension, irregardless of whether they’re doing userland processing or waiting for another mmap/munmap.

                            1. 2

                              Oh right, yes, derp. I’m sleepy too. Yup, I remember speaking with folks about that a while back too.

            2. 7

              One of my more “fun” developer experiences was working on a database-like program that kept the main data in a huge mmap’d file (a couple GB), and then kept many different pointers into that file in separate files that were manipulated using system calls. You may see the problem here right away: if the process is killed for any reason (including SIGKILL), Linux goes to great lengths to sync the mmap’d file to disk as it is shutting down, which is pretty great. But that left all these random pointers to garbage data in the other files, causing angry phone calls to come to me. Anyway, my point is that the semantics of mmap versus system calls are substantially different, not just the performance.

              1. 2

                I guess it’s safe to say that mmap is a sort of DMA mechanism? Really an interesting model. I guess most computers spend a lot of time reading files, but I’m always surprised at the diversity and the tooling around something that often (in high-level languages) just gets mapped to “here’s a stream of bytes”

                1. 3

                  I’ve always thought about it as an inevitable logical consequence of demand paging implemented by a hardware MMU: e.g. “if we can do it for program memory and swap files, why not all memory and normal files; there’s hardware support”. There are even some neat tricks you can do with madvise(2) to control automatic (potentially parallel) readahead.

                  Would love to re-run the author’s experiment with the kernel’s memmove implementation patched in place of the compiler-generated AVX instructions.

                  1. 1

                    “DMA” usually refers to external devices like a NIC or GPU accessing RAM without the CPU’s involvement, I haven’t heard anyone refer to anything that happens on-CPU as “DMA”.

                    something that often (in high-level languages) just gets mapped to “here’s a stream of bytes”

                    Well, that’s the more “common” interface because stream processing is just a very common thing to do (think grep/sed, streaming parsers, sending contents of a file to a socket, writing a log file…) but like, both RAM and SSD are random-access devices, so mapping a file into an address space is the interface that matches the underlying storage more closely.

                    Streams are for rolls of magnetic tape :)

                    On a more serious note though, there is plenty of support in higher level languages for mmap.

                    1. 1

                      my understanding of DMA comes more from stuff like game consoles, where you’re writing to certain memory addresses to do stuff to the GPU (or like… reading memory to see the state of buttons, instead of making some sort of system call). This might not actually be the right use of the term though.

                      Seeing stuff like the Python thing, or the original post…. is there any reason not to go with mmap for files? Since it seems that behind the scenes it does trickery to basically accomplish the same as syscalls?

                      1. 4

                        you’re writing to certain memory addresses to do stuff to the GPU

                        I think what you’re describing is memory-mapped I/O, or kernel-bypass I/O when you’re doing memory-mapped I/O from a process on an OS without going via the OS. DMA is when a device directly accesses memory. You can often trigger this via MMIO. For example, writing commands into a ring buffer and then poking a memory-mapped control register to instruct the device to read the next entries from memory (and, on a GPU, those commands may trigger additional DMAs to read data from other bits of memory).

                        Seeing stuff like the Python thing, or the original post…. is there any reason not to go with mmap for files?

                        There are a bunch of reasons:

                        • It consumes virtual address space. Less of a problem on 64-bit systems (though they’re typically actually 48-bit systems, in terms of virtual address size), but calling mmap on a lot of 16GiB files will use a nontrivial amount of your address space. Depending on how fragmented it is, you will eventually run out, though that might not matter in practice (for example, lld mmaps pretty much everything and even a big program doesn’t require it to map more than around 10GiB of stuff, which is trivial on a 64-bit system).
                        • The sharing semantics are a bit weird. When you read, you get a snapshot at the current time. With mmap and MAP_SHARED, you will see a view of the file that another process can modify while you’re reading it (i.e. in between reading two adjacent bytes). If you use MAP_PRIVATE, you have a private copy, but only after the first time a page has actually been mapped for you. If you touch page A in a region, another process modifies the data corresponding to page B, and then you read page B, it is entirely nondeterministic whether you will see the ‘before’ or ‘after’ view of the page. If the OS happens to have populated that page for you, you will see the ‘before’ view, if it has not then you will see the ‘after’ view. Any subsequent access will see whatever you saw the first time.
                        • There’s no good way of modifying the length of a file via mmap-like views. You can separately ftruncate the file, but if you make it smaller then you will have stale mappings (I think), if you make it larger then you need to map the additional bits. Linux has mremap, but it is similar to realloc and will break any pointers that you have into that region. With write, you can trivially append to a file. Doing the same thing with mmap is much harder.
                        • The mmap family of things only work for file-like objects. They don’t work for things like sockets or pipes, which are inherently stream-oriented. If you want your code to work on any kind of readable file descriptor, using read will work, using mmap will not. You may still choose to use mmap for performance reasons if the file descriptor is a mappable object, but now you have two independent code paths to test. If performance doesn’t matter too much, just using read / write is probably fine.

                        The read / write and mmap mechanisms address different problems. Both enable some use cases that the other doesn’t.