1. 7
  1.  

  2. 4

    The thing about 8-bit numbers is that they’re very small. There are only 256 possible combinations. The naive approach is therefore to had a look-up table. As a bonus, that also gives you a C string that you can return without doing any memory allocation (so if you want to return a char*, rather than insert it into another string, you can do the whole thing with a single lea instruction on x86). The resulting string is 64 bits if you’re storing it into preallocated storage, so the entire function can be three instructions on an architecture with rich addressing modes:

    1. 64-bit PC-relative load with base address as a static relocation + 8 * argument register value.
    2. Store result into address given by second argument register.
    3. Return

    If you want better cache usage, you can do it as two nybbles. This requires two loads, two stores, and two bitfield-extracts (the low one is an and-with-immediate, the high one is and-with-immediate and shift-right-with-immediate on older CPUs that don’t have a bitfield extract instruction) and a return, so seven instructions. This may be faster than the three-instruction version because the entire look-up table fits in 4 cache lines.

    I’m really curious whether the version in the article, which has a multiply but no loads, is faster or slower than this.

    1. 1

      And then you just call that 8-bit version several times and concatenate the result to do it for larger numbers.

    2. 1

      You can spoil the article for yourself by converting the magic numbers at the top to hex. :) (Reminds me of some of the things in Hacker’s Delight, which I enthusiastically recommend.)