1. 15
    1. 3

      I’ve seen a couple of articles on fast search with SIMD, but they’re never relevant to my interests because they assume the keys are numbers. When I have keys to look up, they’re strings/blobs of variable length. Contrariwise, I’ve seen SIMD code for scanning strings, such as Lemire’s work on JSON, but never anything for searching.

      I suppose one could map a string to a 64-bit int by taking the first 8 bytes and zero-padding* if necessary. Then once you did the SIMD lookup as per this article you’d have a second layer to disambiguate strings with identical 8-byte prefixes; perhaps that uses the same algorithm again but on bytes 8..15.

      * although that’s awkward if your keys can contain 00 bytes

      1. 2

        I suppose one could map a string to a 64-bit int by taking the first 8 bytes and zero-padding* if necessary […]

        I believe you can store the strings without padding and then pad on the fly using VPEXPANDB from AVX-512. AFAIU, your current data layout is something like this:

        // Encode ["Get", "Post", "Delete", "Put"]
        const uint8_t key_tree_layout[] = { 
            3, 'G', 'e', 't', 4, 'P', 'o', 's', 't', 6, 'D', 'e', 'l', 'e', 't', 'e', 
            3, 'P', 'u', 't'
        };
        
        

        You could instead encode the string lengths as a bitmap, with an overhead of 1 byte per string:

        uint64_t bitmap = 0b00000111'00001111'00111111'00000111;
        const char* compressed_layout = "GetPostDeletePut";
        
        

        Then you can use VPEXPANDB to pad on the fly while processing:

        bool process_avx512(const char* needle, size_t needle_len, uint64_t mask[], size_t mask_len, const char* haystack) {
            __m512i zmm_zero = _mm512_setzero_si512();
        
            uint64_t buf = 0;
            memcpy(&buf, needle, needle_len);
            __m512i zmm_needle = _mm512_set1_epi64(buf);
        
            for (size_t idx = 0; idx < mask_len; idx += _mm_popcnt_u64(mask[idx])) {
                __m512i chunk = _mm512_load_si512(mask + idx);
                __m512i expanded = _mm512_mask_expand_epi8(zmm_zero, mask[idx], chunk); // VPEXPANDB
        
                /* Find needle in haystack here */
            }
        }
        
        

        This will let you keep the space efficiency of variable length strings but still use SIMD operations for range queries. It will need some thought on how to handle strings longer than 8 bytes though.

        1. 1

          What does one do to have variable length strings as keys for lookup?

          1. 2

            That’s kind of a strange question. What doesnt use variable-length strings as lookup keys? I mean, there are languages in which those are literally the only data structure.

            In my case, I’ve recently been implementing a Prolly Tree, and in the past I’ve built a persistent key/value database inspired by LMDB. I’ve implemented quite a few hash tables too.

          2. 1

            Yeah, most of the string SIMD stuff I have seen aims to speed up a scan along a string, which has rapidly diminishing returns for key lookups when the key is often much shorter than a SIMD register.

            Recently I’ve been thinking about speeding up search near the leaves of a tree, by bundling leaves together so that multiple keys can be compared in parallel. The basic idea should work well for search trees or radix trees, probably for interior btree nodes too. I should read “Is prefix of string in table?” (via ~xoranth).

            I like your idea. My qp-trie is fairly successful compared to other kinds of radix tree but it still has a lot of indirection. One idea in Phil Bagwell’s paper on HAMTs (which was one of the inspirations for the qp-trie) is to have a jumbo node at the root of the trie, so instead of normal up-to-64 wide nodes using 6 bits of key each, chomp off (say) 10 bits of key for an up-to-1024 wide root node. I wonder how a vectorized binary search would perform.

            1. 1

              I have a data structure I’m calling a KeyTree until I come up with a better name. It’s a flattened binary search tree where a node is a string followed by the left subtree followed by the right subtree. The left subtree is preceded by a byte count so it can be skipped quickly. I’m also using prefix compression at each node, so in a lot of use cases it’s extremely compact.

            2. 1

              why do you not hash the strings?

              1. 2

                Because I need ordered comparison.

                If I hashed the strings, I’d use a hash table instead of a search tree / binary search.

                1. 1

                  is there a reason they have to be strings (rather than structured objects)?

                  1. 2

                    Structured objects have to be serialized to be stored. And custom comparison functions are awkward for k/v stores, because you can’t use the db at all if the function isn’t registered, which makes CLI tools infeasible.

                    Fortunately you can usually serialize a complex value to a form that can be compared lexicographically, I.e, with strcmp/memcmp.

              2. 1

                do you know of any trees that are unicit with ordered keys? Unicit meaning it has the same tree structure given the same data.

                1. 1

                  Most tries / radix trees have this property. It isn’t guaranteed to be true, if the trie has some flexibility in its node size or layout, but it is true of (eg) crit-bit trees and qp-tries.

                  1. 1

                    But are tries / radix ordered? I guess it’s not strictly required but not hard for them to be.

                    But my understanding is that they’re not used as indexes in DBs, because they’re not cache aware like b-trees?

                    1. 1

                      They are strictly bytewise lexicographically ordered, so if you need some more complicated collation then the keys need to be transformed before use with a radix tree. (The unicode collation algorithm can be used this way.)

                      What’s important for cache-friendliness is that the tree owns the keys so that the keys can be stored inline with the nodes. Radix trees do this by construction (though crit-bit trees and qp-tries still need a copy if the key in each leaf node). A qp-trie overlaps child node calculation with child node fetch from memory, which is unusually cache-friendly for a tree structure.

                      I think b-trees are more popular because they have flexible node sizes so you can pack keys and pointers into a page efficiently. The node size of a radix tree is fixed by the shape of the data. You need to match the maximum fan-out (the radix) to the page size. Maybe you can pack several nodes into a page when they are small (near the leaves).

                      Traditionally radix trees have had poor memory usage, but more recent designs (Judy trees, adaptive radix trees, qp-trees) are parsimonious. But b-trees are decades older.

                  2. 1

                    Yes: Prolly Trees. I’ve been working on implementing one recently.

                    1. 1

                      Are they also cache-friendly? It seems like the version where you build them from scratch would be. But for larger trees, you’d need to update them incrementally, and it’s not clear to me that they would be cache-friendly.

                      1. 1

                        Yeah, I don’t have a way to update it incrementally. I’m relying on insertion performance not being critical. In my benchmarking, everything is dominated by I/O, so making data structures small is important.