1. 24

  2. 28

    This is fascinating, but it’s also very time limited in value. In snmalloc, we’re using a small lookup table for sizeclass calculation and we’ve benchmarked it: it is faster than doing the arithmetic on modern hardware in our usage. The really useful lessons here are:

    • Anything you learn about performance may change between microarchitectures and across microarchitectural changes. Always benchmark.
    • Trades between compute and memory often change over time and can be very different on CPU vs accelerators.
    • Microbenchmarks for these trades aren’t always representative. Lookup tables suffer from L1 d-cache misses, computing suffers from L1 i-cache misses. The relative cache pressures may change between workloads.

    The most important of these is that any received wisdom on what is fast is probably wrong, even if it was completely correct when the person who originally said it made the claim. Always measure!

    1. 8

      Why is there virtually nothing about the cache hierarchy in this article? Cache misses are what will primarily determine the viability of lookup tables, and one should size a lookup table with cache sizes in mind.

      1. 6

        An ML model is a glorified lookup table. A hash map is a lookup table. An array is a lookup table.

        If the value you are looking up is easily computable, it might not need a lookup table.

        1. 1

          If the value you are looking up is easily computable, it might not need a lookup table.

          Obviously, “easily computable” differs based on your environment. For junk I do, the only times my starting position is a lookup table is when IO is involved or when working with a lookup table is much easier for me to model mentally. Otherwise it takes a measurement to drive me towards a lookup table. Even when that “lookup table” only takes the form of throwing @memoize in front of my function declaration.

        2. 5

          Aside from the other counter-points, some of the other reasons for lookup tables are:

          • Cross platform (OS/processor). There’s no need to figure out how to write the code again, the lookup table will work.
          • Part of an algorithm written for smaller processors - CORDIC used to be in very small embedded systems.

          And for emphasis, I’ll repeat @david_chisnall “Always measure!”

          1. 4

            In 100% of my test cases, using a lookup tables is not good for performance.

            This just isn’t true of this set of benchmarks. In almost every case for sin and cosine one or two other methods were slower than a lookup table. By contrast for bit counting the lookup was always the slowest.

            There’s a much more interesting story here: why are these the relative speeds, and would they look different when embedded in a real workload?

            1. 1

              The problem with this survey, is that it assumes you’re making function calls for your approximation, the cost of which dwarfs indexed access to a global. In an environment where performance was considered important enough to warrant the loss of precision with a lut, it is unreasonable to think that the access functions would not be inlinable.