1. 33
  1.  

  2. 8

    Pretty neat!

    Took me a while to understand how v5 worked. In case anyone else has a hard time understanding:

    When they say that each entry pointer is offset by 3 bytes (N * 8 + 3), this is due to the dirent_t type:

    struct dirent64_t {
      ino64_t d_ino; // 8 bytes
      off64_t d_off;  // 8 bytes
      unsigned short d_reclen;  // 2 bytes
      unsigned char d_type;      // 1 byte
      char d_name[];                  // not aligned
    

    Those 3 bytes (d_reclen + d_type) and the first 5 bytes of d_name compose a full 64-bit word (DWORD). After skipping the first 5 bytes, the memory entry pointers (e.g., a + 5) are aligned.

    1. 7

      That’s right. alignof(dirent64_t) is 8 and offsetof(dirent64_t, d_name) is 19, so d_name is aligned at N * 8 + 3 mark. If we add 5 to such a pointer, it’ll be aligned to 8. memcmp really likes pointers aligned to 8, so we really want to give it our pointers plus 5. But this means we have to handle the first 5 characters manually and we should do it faster than memcmp. If we’ll just go over the first 5 bytes one-by-one, we might as well pass the pointer to memcmp unmodified because it can also go one-by-one.

      The insight is that we can compare the first 5 characters faster than memcmp if we mutate the strings as a preparation step. In fact, we’ll compare not 5 but 8 first characters. Before we even start sorting, we reverse the order of the first 8 characters in every string. If the string is shorter than 8, we’ll scoop some past-the-end garbage, but that’s fine – we’ve arranged things so that past-the-end memory is also within our own buffer.

      Then, to compare two strings, we read the first 8 bytes as uint64_t via an unaligned read (on Skylake such reads aren’t penalized) and compare these two 64-bit integers. If one is less than the other, it means the string is less than the other (we reversed bytes specifically to achieve this property on Little Endian architecture). If the numbers are equal, we ask memcmp to compare everything after the 5th character, effectively comparing characters 6-to-8 twice.

      Hope that helps.

    2. 4

      I was expecting to see the kernel being bypassed and directory structures being read directly from disk in bulk (obviously with limited fs support)

      1. 4

        Ha-ha :-) Maybe next time. The main bottleneck in gitstatusd is fstatat, that’s why directory listing optimizations are in a separate document. If I could optimize that, it would yield serious gains in performance.

        Edit: By the way, the improvements in the article look pretty small between different versions because CPU time is dominated by system calls that are out of my control. If you look at userspace CPU only, the speedup between v1 and v5 is 15x. In other words, the final version uses 15 times less CPU time in userspace. It also uses less CPU in the kernel, but the improvement there isn’t as dramatic (just the openat vs open optimization).

      2. 3

        Crazy post. I hate it but I love it.

        To avoid heap allocations we can use a simple arena that will allow us to reuse memory between different ListDir() calls.

        Modern std::string stores short strings inline, so it will store all of your 16 character test filenames inline. I suspect the only benefit from v2 is reusing entries vector. If you reuse entries in v1 but otherwise leave it unchanged, it should be as fast or faster than v2 for that test. Although you end up going a totally different direction in v4 so it doesn’t matter.

        In v4 though, I don’t see any point in using std::string as your buffer instead of std::unique_ptr<char[]>. You don’t even need that buffer zeroed, since the null terminators will always end the comparison.

        In v5, you use 255 as the length for memcmp, which will force it to read the last 7 bytes one at a time. Might as well do 256, since you’re already depending on the null terminator to end the comparison anyway.

        Edit: you said in another comment that unaligned reads aren’t penalized on Skylake, TIL. You’re also doing an unaligned read of d_name directly. It would probably be even faster to read d_name - 3, mask it, and then compare. Let me know if you try it, because I’m not totally sure. You might be getting some speedup from short-circuiting before memcmp by examining the first 8 bytes of the name, and you might lose some of that advantage by going down to examining the first 5 bytes. I’m not sure how much that is helping you, therefore I’m not sure if aligning the read would be worth it.

        1. 2

          Modern std::string stores short strings inline, so it will store all of your 16 character test filenames inline.

          The limit is 15 characters in libstdc++ – not enough for 10% or so of files. Also, swapping 32-byte objects during sorting is noticeably slower than swapping pointers. libc++ has 24-byte strings and can embed up to 23 chars. The downside is that it has a branch in data() which slows everything down.

          But as you mentioned, going zero-copy beats both v1 and v2, so the point is moot.

          In v4 though, I don’t see any point in using std::string as your buffer instead of std::unique_ptr<char[]>.

          Simpler code for the example. The real code uses a more advanced Arena that supports variable-length allocations and custom alignment.

          You don’t even need that buffer zeroed, since the null terminators will always end the comparison.

          The code is reading up to 7 bytes past the terminating nul, and reading uninitialized memory is Undefined Behavior (I’m not zeroing it in prod code but I really should).

          In v5, you use 255 as the length for memcmp…

          It’s 256 in the code. I never explain why I use 256 rather than 255 but the reasoning is close to what you said. There is some discussion in https://github.com/romkatv/gitstatus/pull/18.

          1. 1

            Sorry for very delayed reply. It’s good to hear the real code addresses all my comments. :)

            The limit is 15 characters in libstdc++

            RIP. I’m used to libc++.

            it has a branch in data() which slows everything down

            I’m really skeptical of that without measurement. My intuition says it should branch predict to inline storage.

            reading uninitialized memory is Undefined Behavior

            I have such mixed feelings about this. On one hand, UB should be avoided. On the other, I can’t imagine any CPU running git that would have a problem with this. On the third hand, what if C++ code ultimately gets compiled to completely different targets that may not resemble real hardware? Things like emscripten exist, where might that go in the future? In this case the waste to avoid UB is negligible, so I guess it doesn’t matter.

            1. 2

              RIP. I’m used to libc++.

              At Google? I quit a year ago, and while there were plans to switch over to libc++ I didn’t expect them to come to fruition by now.

              I’m really skeptical of that without measurement. My intuition says it should branch predict to inline storage.

              Branch prediction is a simple device. If all your strings fit within inline storage or all don’t, branch prediction will work all the time. If strings alternate between inline and out-of-line, it’ll fail all the time. As far as branches go, this one is on the expensive side of things, for a typical application has a mix of short and not-so-short strings. On the other hand, the benefits of smaller sizeof(std::string) and larger inline storage are large.

              Synthetic benchmarks are, unfortunately, of no help here. It’s trivial to write one that will show std::string libc++ implementation outperforming libstdc++ by a huge margin (e.g., a benchmark that creates and destroys strings 16 characters in length). Likewise for the opposite (create a short and a long string and call data() on them in alternating fashion, with a DoNotOptimize() call in every loop iteration). Even a massive load test in production (e.g., go/sll) won’t be conclusive. If, for example, the load test shows lower performance with libc++, you might be able to change the code in a few places (e.g., cache the result of std::string => std::string_view conversion where it’s invoked frequently and is marked expensive by the profiler) to flip the table. You need many iterations of this sort to determine which std::string implementation is a better default. It’s difficult and time consuming but ultimately can be done. For example, I and a few of my colleagues did this before pushing SwissTable and movable protos.

              1. 1

                At Google?

                No. I haven’t had a reason to dig in to STL performance at Google, and my previous job rarely used STL types at all.

                for a typical application has a mix of short and not-so-short strings

                I was thinking about your application, which is dominated by < 22 char strings.

                go/sll

                I never expected to see a go link on lobsters. ;)

                I and a few of my colleagues did this before pushing SwissTable and movable protos.

                Well that explains why flat_hash_map is so good. I’m reading go/swisstable-performance now, y’all really put in the work to make the best hashtable you could.

                I can see how libstdc++ string vs libc++ string isn’t a totally cut and dry comparison for most applications. Lots of my intuition about performance comes from my previous company where using a shared allocator was often a nonstarter performance-wise, so the increased inline storage of libc++ string appeals to me for that reason.

        2. 3

          Do I understand correctly, that the reinterpret_cast in v2 is used as a smoke screen to hide from the type system? Could a union be used there instead? Or would it have some performance/memory disadvantages?

          1. 2

            v2 is actually quite tricky. You are correct that it could use vector<U>& entries as parameter where U is a union of size_t and char*. This, however, would make its API more complex. Callers don’t want vector<U>, they want vector<char*>.

            ListDir could use vector<size_t> internally and then transform it to vector<const char*> but this would either require an extra allocation in every ListDir call, or an extra vector argument for scratch space. It’s also more memory bandwidth intensive.

            The solution in v2 keeps the API simple and avoids extra costs by reusing the vector for two purposes.

            1. 1

              Ah, right, didn’t think about the API, thanks!

          2. 2

            Fast and respectably arcane.

            Love it!

            1. 2

              Cool article. Author obviously knows all of this and now I’ll be able to write an implementation like this as well, but now would one beginner or intermediate c++ programmer find out all of these things? Where to look, what docs, which books etc? (To apply them to another arbitrary problem)?

              1. 7

                That’s a good question. Factoids like “openat() from the parent directory is faster than open()” aren’t common knowledge and I don’t know how they could be discovered. They aren’t mentioned in man pages or in books. What makes matters worse is that there is almost infinite amount of them because they are about specific implementations. They also get obsolete and you need to take notice. For example, my implementation takes advantage of the fact that unaligned reads are not penalized on Skylake. That Read64 function in the article is as fast as *reinterpret_cast<uint64_t>(p). It uses memcpy is to avoid Undefined Behavior but it actually compiles to a single instruction. On previous x64 CPU architectures unaligned reads were expensive, so I would have to write this code differently.

                1. 5

                  Take your university operating system class? They probably won’t teach you about openat, but my course at least involved implementing a file system at which point you realize that open(”/here/there/every/where”) is going to involve a bunch of operations.

                2. 1

                  Great article, I like it!

                  The intro states:

                  Moreover, the list must be sorted lexicographically to enable fast comparison with the index.

                  What kind of comparison is needed here? If it’s membership, wouldn’t turning index into a hashmap/bloom filter yield better performance, as one could avoid sorting?

                  1. 1

                    I’m bound by the on-disk Git index format, which is essentially a lexicographically sorted list of file paths. Every time Git index changes, I need to update my inmemory representation. I do use a different inmemory representation from what Git uses but I’m limited in my choice by the requirement that a transformation must be cheap.

                    I could employ a trick that I use with tags where on the first run after index change I use one algorithm, and on the following runs I use another algorithm that requires a completely different data structure. To avoid the cost of transformation on the first run I perform it after replying to the first request. By the time the second request for the same repository comes in, I’ve already reindexed data and ready to serve results much faster. This allows gitstatusd to resolve tags 4 times faster than git describe on the first run and 400 times faster on the next.

                  2. 1

                    On most of my systems, the slowest thing about git/subversion etc information in the shell prompt is where it searches up for the .git directory in parent directories. If I’m not in a git repository then there’s a high chance it’ll reach a directory controlled by the automounter. It then triggers an attempt to NFS mount .git, .svn etc from the fileserver. If it could simply stop when the parent is on a different filesystem by checking statvfs, it’d be much faster.

                    1. 1

                      It would be faster to not allocate any memory at all, and used a fixed buffer to iterate over the dirents directly.

                      1. 1

                        Indeed. No work is the fastest kind of work you can accomplish.

                        If I published this as a library, Arena would support allocating the first block from the stack. ListDir would stay allocation-agnostic just like it is right now.

                        1. 2

                          Well I guess my point is that a ListDir()-style API for directory traversal is fundamentally inefficient. What happens when the directory has hundreds of thousands of entries? It would be better to wrap the OS-specific APIs in a C++ range style object.

                          1. 1

                            I wonder how you could make it more efficient with a range-based API. Sorting is pretty challenging if you don’t read all entries.

                            1. 0

                              Yeah that’s a good point, forgot that you are sorting. This advice doesn’t generally apply then. On the other hand, most modern file systems return entries in sorted order and if you can detect that you may be able to avoid sorting in user-space (Ext* family used HTrees https://en.wikipedia.org/wiki/HTree)

                              1. 4

                                Are there really mainstream filesystems that give you orderest listing of directories? I’m pretty sure HTree isn’t ordered, and by extension no mainstream filesystem on Linux is ordered.

                                1. 1

                                  Hmm yeah HTrees are indexed by hash of filename, not filename itself. If you wanted to really amp up gitstatusd re-indexing on Ext* you could sort your internal database by that same hash…

                                  https://github.com/torvalds/linux/blob/034f891a844bba3665c2313bcbf61f335dd422e8/fs/ext4/hash.c#L199

                      2. 1

                        So is the use of this to do a platform specific function and the POSIX general function for everybody else? I didn’t think C++ did that sort of conditional compilation

                        1. 4

                          Yes, that’s how it’s done in gitstatusd. The linux-specific implementation is under #ifdef __linux__. This is very common in C and C++.

                          1. 2

                            There are many wrappers one can choose from if the goal is convenience: http://man7.org/linux/man-pages/man3/fts.3.html, http://man7.org/linux/man-pages/man3/ftw.3.html, http://man7.org/linux/man-pages/man3/scandir.3.html. Just like the code you linked to, they are all built on top of the glibc wrappers that I had to bypass to get maximum performance.