1. 7
    1. 4

      Note that GCC and Clang (at least) can perform idiom recognition if you write the bit counting loop in the typical way. If sufficient optimization is enabled and the architecture supports the popcnt instruction, they will notice the loop and optimize it down to the popcnt instruction. I prefer this approach since I can write the code in a portable way and leave it to the compiler to do the fancy thing if it can.

      Godbolt Compiler Explorer example

      1. 3

        I prefer this approach since I can write the code in a portable way and leave it to the compiler to do the fancy thing if it can.

        It feels to me that C/C++ compilers wouldn’t be these unreliable, buggy pieces of software if they stopped doing this.

        I think doing the opposite is much better (and would be as portable as your approach): provide …


        … and use the efficient instruction where available and a fallback where not.

        1. 4

          Yes, I agree that that’s even better when available. However, that requires library support in the language standard, which tends to lag. Fortunately, we now have std::popcount() as of C++20.

    2. 3

      If you’re doing popcount over large, constant bit arrays, then one practical approach is to precompute the total in fairly large blocks (say, every 64 or 256 KB), and also individual counts in smaller blocks (perhaps 4 or 16 KB). Use the total from the last large block before the offset, add the smaller blocks, and then only do popcount for the few remaining words – this can give you a count (rank) in constant time, and it’s especially useful when all of the words are compressed and would need conversion to popcount.

      Many succinct data structures are built on top of these operations, so making them fast is critical.

    3. 3

      Inelegant but potentially efficient idea: a lookup table that maps every byte to the number of bits set in that byte. A table for a single byte should only require 256 bytes on an architecture that supports byte-aligned accesses (and a load-byte instruction), and that’ll fit into the L1 cache of any modern processor.

      1. 2

        Bit twiddling hacks has a nice overview:


        Including table lookup. For fun, I once did some benchmarks for a course, IIRC the table lookup did pretty ok for a ‘naive’ method.

      2. 1

        Alternately, a table for all u16 would fit in 256K, although you would have to mask/shift the first/last 4 bits.

        1. 1

          Hm? I’d have thought 64K, or 32K if you special-case zero or 0xffff out of band and store two entries per byte (32K conveniently being the size of the L1 D-cache on most current x86 CPUs).

          1. 1

            Oh, oops, I miscalculated. You’re right.

    4. 2

      Popcount is useful for implementing sparse arrays. Use a bitmap to mark which elements are populated, and an array containing those elements. Then given an index, popcount and some masking tell you which array index the value is at.

      This trick is used in the Hash Array-Mapped Trie data structure to make the 32- or 64-way tries less unwieldy.