1. 10
  1.  

  2. 5

    I saw this and felt that there were some missed optimization opportunities. So, of course, I had to go in and try them out.

    As a baseline, my machine averages 426ns/iteration for the encode function. (I’m ignoring decoding, because, as we’ll see, this is just a special case of encoding)

    First, we have some simple Rust-level optimizations: if we pass the blocks into the hamming code functions by value instead of by reference, Rust’s optimizer can keep them in a register instead of writing them back to memory. Further, x86_64 calling conventions leave the first few 64-bit integer arguments to a function in a register, so we can avoid three memory accesses. I’d imagine that a Sufficiently Smart Compiler™ might optimize away the dereference, but based on my benchmarks, Rust isn’t doing that. This alone brings encode down to 364ns/iter, a 16% improvement.

    And now we have some algorithmic changes we can do. I make two changes here: First, we spend a lot of time shuffling bits around to put the parity bits at power-of-two bit positions. However, we could just as easily put them in the currently-unused high bits and save nearly all of the shuffling, at the cost of having to keep separate track of “physical” and “virtual” bit numbers during the parity calculation. Second, we can compute all of the parity bits at the same time by simply XORing the virtual bit number of each set bit together. (We also compute the word parity at the same time by always setting the low bit in that virtual bit number):

    const DATA_BITS: u64 = 57;
    const DATA_MASK: u64 = (1 << 57) - 1;
    
    pub fn encode(mut block: u64) -> u64 {
        // We put the parity bits at the top for performance reasons
        
        return (block & DATA_MASK) |
               ((full_parity(block) as u64) << DATA_BITS);
    }
    
    pub fn full_parity(code: u64) -> u8 {
        // Bits 0, 1, and 2 of the putative check word are parity bits, so the first bit is logically bit 3
        let mut dv = 3;
        let mut check = 0;
        for i in 0..(DATA_BITS as u8) {
            let mut virt_bit = i + dv;
            if virt_bit & (virt_bit - 1) == 0 {
                virt_bit += 1;
                dv += 1;
            }
            check ^= if code & (1 << i) != 0 { (virt_bit << 1) | 1 } else { 0 };
        }
    
        return check;
    }
    

    This brings us down to 139.31ns/iter, a 62% improvement. However, we’re not even remotely done yet.

    What if we could compute the parity values for multiple bits at once? We’re doing all of these computations in a 64-bit word, but we’re only using 7 bits of it. This seems wasteful, and in fact it is. Observe that for any given bit, the check value for that bit will either be 0 or a bit-specific value. We can then compute the parity bits for the first bit in each byte at the same time, then the second bit in each byte, and so on.

    We do this by defining a constant for each bit position:

    const PARITY_WIDE: [u64; 8] = [
        0x7f6f5f4f3d2d1b07,
        0x017161513f2f1d0b,
        0x0373635343311f0d,
        0x057565554533230f,
        0x0977675747352513,
        0x1179695949372715,
        0x217b6b5b4b392917,
        0x417d6d5d4d3b2b19,
    ];
    

    Then, we mask out the check values that should be included and merge it into our check code:

        let mut check = 0;
        for i in 0..8 {
            let bitset = 0x01010101_01010101 & (code >> i);
            let code_part = u64::wrapping_sub(bitset << 8, bitset) & PARITY_WIDE[i];
            check ^= code_part;
        }
    

    Finally, we combine the 8 check codes in our check word horizontally, the same way that the fast parity function worked:

        check ^= check >> 32;
        check ^= check >> 16;
        check ^= check >> 8;
    
        return (check & 0xFF) as u8;
    

    Using this new full_parity function, our time goes down to 8.5ns/iter, a 93% improvement!

    (As an aside, how is this so much faster? Normally, you’d expect that because the loop runs 1/8 of the number of times, it would be a 87.5% improvement. However, we no longer have an unpredictable branch in the inner loop, and that saves a lot of time)

    I’m sure there are more tricks that could be used to eke out a bit more performance, but I’m happy with where this stands now :-D

    If you’d like to play with my code, I’ve pushed it here: https://github.com/thequux/hamming-code