I’m curious what the setup cost for libdivide is, since it has to find the multiplicative modular inverse of the divisor you pass in. You would need to count the trailing zeroes of the divisor, and then do a bunch of real div/mods for Euclid’s algorithm to get the inverse of the factor of the divisor that is coprime to 2**n?
I don’t think it works like that, the method you mention can only be used to divide multiples of the divisor. E.g. 11 is the modular inverse of 3 mod 16, and indeed we can use this to find 6/3 by calculating 6 * 11 = 66 = 2 mod 16. But we can’t use the same method to calculate 7/3, since it gives 7 * 11 = 13 which is obviously incorrect (although it is correct for 13 * 3 = 39, since 7 = 39 mod 16).
The following explanation doesn’t take into account vectorization; I don’t know exactly how it works (although I can make an educated guess I rather stick to what I know).
As far as I know, for a divisor d, libdivide precalculates a N-bit binary expansion of 1/d (rounded to closest with a preference to rounding up since it is easier). Then to do a division of an N-bit value, we multiply by the precomputed value, take the high N bits (so the multiplication needs to give this bits, so we need either a widening multiply or a multiplication that only gives these high bits), and shift them to the right.
So, essentially the precomputed value is m = (1 << k) / d for some suitable k >= N, and n / d is evaluated as (m * n) >> k.
Now, picking k still requires some theory, and the situation is slightly different for signed integers, but this is the gist of it.
I have implemented the optimization myself, I’d expect the one from libdivide to be similar:
Thanks! I’m sure I’ve seen the multiplicative modulo inverse used in practice for something, but I cannot remember what.