1. 9

TL;DR: I’m looking for a SIMD abstraction layer that will let me do some high-speed byte searching without having to worry about a dozen different instruction sets. Not having luck finding such.

I’m optimizing some JSON-parsing code, and by far the biggest bottleneck is string searching, specifically searching for the next occurrence of one of two or more characters. For example, when scanning a string I search for both double-quote and backslash. (Plus 00 if the JSON has not been validated.)

  • C has a function to do this, strcspn, but it’s slow.
  • A naive loop testing every char with several == comparisons is faster.
  • indexing each byte in a 256-item lookup table is still faster.
  • Unrolling the loop to look at 4 bytes in sequence is even faster (stole this from sa-json.)
  • “Homemade SIMD” by reading a uint32 or uint64 and doing bitwise tricks to detect any of those bytes is significantly slower, sadly.
  • [Goldilocks solution goes here]
  • The SIMDJSON library is super fast but has a whole bunch of complex code — it has to implement the core loops six or seven times for specific instruction sets, and then figure out at runtime which of those to call. I would rather avoid dragging all that in.

Yesterday I was experimenting with Apple’s “simd” library, which is based on Clang vector extensions, and it does pretty well, but I am not sure if this is portable to non-Apple systems or non-Clang compilers. (GCC has its own extensions, which Clang partially supports, but when I tried those Clang generated slow non-SIMD code.)


  2. 6

    Often this trick works:

    • instead of iterating element-by-element, iterate by chunks of length N (where N is compile time constant)
    • write scalar code for the inner loop
    • compiler auto-vectorizes the inner loop for the instruction set at hand, because it does a constant amount of iterations.

    This thread gives an example of this technique in Rust:


    This even works with runtime-detection of available SIMD instructions, you just need to convince the compiler to codegen several copies of function for different target features.

    1. 1

      That sounds like the idiom I got from sa-json, which in my code I abstracted as:

          /// Finds the first character matching a predicate. Leaves `*pp` pointing to it.
          template <typename PREDICATE>
          inline char firstCharMatching(const char* *pp, PREDICATE pred) {
              const char *p = *pp;
              char c;
              while (true) {
                  c = p[0]; if (pred(c)) {p += 0; break;}
                  c = p[1]; if (pred(c)) {p += 1; break;}
                  c = p[2]; if (pred(c)) {p += 2; break;}
                  c = p[3]; if (pred(c)) {p += 3; break;}
                  p += 4;
              *pp = p;
              return c;

      This is what I’m currently using, as I haven’t gotten anything else to run faster. I tried unrolling it further, to 8 lines, but that was slower. (On my computer, which is a MacBook Pro with an M1 Pro.)

      I’m able to parse+validate JSON at up to 2GB/sec depending on the input file, so it’s not too shabby.

      1. 6

        The right way to answer these kinds of questions is to look at godbolt, but, as this is a template and I can’t just paste it into the compiler explorer as is, I would hazard a guess instead!

        My gut feel is that the compiler would have trouble with those early breaks — as written, the code has the property that it doesn’t load any bytes past the matched character, and auto-vectorized version would do a bit of an over-read here. This should be equivalent, but I think as of today clang has trouble seeing that (again, as I haven’t looked at the assembly here, this is a wild speculation and is probably wrong). So I’d try to help the compiler a little bit more here, by making the code branchless.

        Applying the trick in Rust seems to help relative to Rust version with manual unrolling

        naive loop    = 145.51ms
        unrolled loop = 111.47ms
        chunked loop  = 26.41ms
        pub fn find2_fast(haystack: &[u8], needle: [u8; 2]) -> Option<usize> {
            find2_chunk::<32>(haystack, needle)
        fn find2_chunk<const N: usize>(haystack: &[u8], needle: [u8; 2]) -> Option<usize> {
            let mut chunks = haystack.chunks_exact(N);
            let chunks_len = chunks.len();
            let chunk_index = chunks
                .position(|chunk| {
                    // Manually write everything to make sure there's no short circuiting.
                    // This makes the code branchless and auto-vectorizable.
                    let mut has_first = false;
                    for i in 0..N {
                        has_first |= chunk[i] == needle[0];
                    let mut has_second = false;
                    for i in 0..N {
                        has_second |= chunk[i] == needle[1];
                    has_first || has_second
            let offset = chunk_index * N;
                .position(|&b| b == needle[0] || b == needle[1])
                .map(|it| it + offset)

        Full benchmark here:


        And here’s godbolt of manually unrolled loop with branchless chunked:


        You can see that the former has a bunch of funky vector instructions like vpmovmskb, while the latter is, well, an unrolled loop.

        I am only 0.7 sure that the results would be the same for C++, but, if godbolt for your version doesn’t actually include a lot of vector instructions, than its 0.95 that a massive improvement is possible!

        NINJAEDIT: changed the commit, as the first version of the code didn’t actually auto-vectorized and contained memchr calls.

        1. 1

          Possibly a naive question, but how/where do you make sure that the p[n] dereferences are valid? Is this encoded in the predicate, where it will at least be valid once?

          1. 2

            The predicate is assumed to be true at least once before the end of the data — either the predicate checks for a 00 sentinel, or the data’s already been validated to contain what it’s looking for.

            1. 1

              My understanding (not a Rust expert) is that this is achieved using the chunks_exact slice iterator which only looks at the part of the buffer that’s evenly divisible by the chunk size. I assume dealing with the remainder is left as an exercise to the reader. (Or the calling code can arrange to make the buffer an evenly divisible size, padded out with nul bytes - usually the cleanest option when doing SIMD if you can somehow pull it off.)

              1. 1

                The code I was commenting on is C++, though.

        2. 4

          Modern compilers are pretty good at vectorising if they can prove that things don’t alias and are adequately aligned. I’ve had both gcc and clang generate better code by adding restrict and alignas in places than by manually vectorising. There’s also the vector attribute that gives you primitive operations on vector types. There’s also an OpenMP simd pragma that basically says ‘skip the analysis, I promise it’s safe to vectorise’.

          1. 1

            Would Intel’s ISPC be at the right level of abstraction for you? (As well as x86 it also targets NEON. Many major OSes are supported.)

            1. 1

              Interesting! It would require writing the code in their custom C-like language; I wonder how much of the core parsing I could put in there, and how much it could speed it up. Thanks for the link.

            2. 1

              simd everywhere may be of interest

              1. 1

                In Julia you could use the SIMD.jl package to do SIMD manually while also abstracting across different processor types. Or you could use Metal.jl to create arrays on the M1’s GPU and perform operations on it pretty easily. Or you could use LoopVectorization.jl and have the library and the compiler be very free with rewriting standard for loops that you prefix with a macro.

                1. 1

                  OK, but this is C++ and will be used in a bigger C++ and/or Go project.

                  1. 1

                    Cool, well nevermind :)

                2. 1

                  indexing each byte in a 256-item lookup table is still faster.

                  Careful with this one. I’ve had performance regressions by measuring this in isolation, seeing that the lookup table version was “faster” and so switching to it, but then later discovering that in the actual situation that code runs in, it was significantly slower than straightforward comparisons because the cache behaviour is worse.

                  Assuming that micro-benchmarks reflect reality can be trouble.

                  1. 1

                    Hm, does parsing a 35MB file still count as a micro-benchmark? I’m testing with a dozen or so files of up to that size.

                    In actual use, some of the work will be a single scan through potentially-large data (validating / indexing the JSON), followed by smaller scans of subranges of the data for individual queries. The latter might expose the kind of cache slowdown you describe, but how to account for that in tests?

                    I’ve also considered making the LUT a bit-vector rather than a byte-vector, so only 32 bytes instead of 256 at the expense of some extra bit-twiddling, but I haven’t benchmarked that.

                    1. 1

                      Doing full parsing with large files is probably fine, I’d say. An obvious micro-benchmark would be just benchmarking that function itself, although It Depends. Parsing and just throwing everything away might be a bit different from parsing and actually using the data for something in an application, like loading it into a data structure. Data aside, you can also get “fun” effects from the instruction cache when different approaches have different amounts of code that lives in different places.