1. 6

  2. 2

    I noticed that your round function rounds away from zero, just like C and Posix. This jumped out at me because the languages I’ve been using recently, which care about numerics, use round-to-even semantics instead for their round function. In fact, the IEEE float standard defaults to round-to-even for rounding arithmetic results, and doesn’t have a round-away-from-zero rounding mode. This is because round-to-even reduces bias.

    So I went down a rabbit hole. Why does C round use round-away-from-zero? My first guess is that it’s because that’s how the PDP-11 works, and C was originally designed as a thin layer of abstraction over PDP-11 assembly language. I checked, and indeed, the PDP-11 floating point unit had pre-IEEE float semantics, and defaulted to rounding away from zero. So that’s probably it.

    It’s weird that we are still building software that prioritizes PDP-11 floating point semantics over IEEE float semantics and modern best practice for floating point computation. It’s because of C – the way C works is often more important than those other considerations.

    1. 1

      Yup that’s explained in the article, notably when I’m mentioning the Python rounding. Rounding to even needs a different method which I didn’t investigate yet. Do you have a proposition for that? With a proof? I’m curious if it’s going to be simpler or not

      1. 1

        Edited the post to add a version for the round to even choice.

    2. 1

      Does C99 guarantee that the sign of the result of xoring two integers is the xor of their signs? As far as I can tell the ^ operator just operates on the binary representations of the integral values, making it implementation defined. If you wanted to have it be defined, I think upcoming C23 requires two’s complement representation for integers, making it trivially defined.

      One place where this could cause issues is on a machine with one’s complement representation, such as the 2015 UNISYS ClearPath Dorado 8300. If you xor a number with its negation, e.g. 1 with -1, you will end up with an all-bits-set result, corresponding to a negative zero. Since -0 = 0, (a^b) < 0 would return false even though the division would produce a negative result.

      Also, you should probably note that abs(INT_MIN) is undefined alongside the alternative div_ceil() and div_floor() implementations making use of it.

      1. 1

        I added a note about the xor thing, and actually realized a mistake with a^b. I also added a note about abs(INT_MIN).

        Thanks for pointing these out!

      2. 1

        For integers in the range [-((2^52)-1), (2^52)-1], you can evaluate floor and ceiling division using float math (ie rounding twice), and get correct results. This is pretty easy to see if you consider that:

        1. Division is monotonic

        2. All integers within the range are exactly representable

        3. Roundoff can never exceed the threshold where rounding twice would be a problem, because the denominator always has magnitude >1, meaning that the precision of the result will be at least that of the inputs

        Something similar applies to sqrt. Knowing this—and when it’s true—is valuable, because many processors provide optimised vectorised instructions for floating-point numbers which can run faster than the corresponding integer operations (if they even exist at all).

        (Floats can represent all integers with magnitude <=2^53, but when I tried to prove that sqrt gave correct results for all integers in that range, I encountered some annoyances; leaving a bit of buffer simplified matters. I have a strong intuition that it works for the full range, but haven’t bothered working out the details yet.)

        1. 1

          Float precision for integers breaks at 2²⁴. You can try it with this:

          #include <stdio.h>
          int main()
              for (unsigned i = 0; i != ~0U; i++) {
                  float f = i;
                  if ((unsigned)f != i) {
                      fprintf(stderr, "precision breaks at i=0x%x (f=%g)\n", i, f);
                      return 0;
              return 0;

          => precision breaks at i=0x1000001 (f=1.67772e+07)

          What code snippets are you actually suggesting for floor and ceil?

          Note that the reason for working with integers are multiple, performance is one but not the main one. Working with floats can be problematic for all sort of reasons, and sometimes you want to avoid them entirely: in kernel land, with specific arch with inefficient or unavailable floating point, when precision/determinism is needed (don’t want system/libc approximations to slip in), …

          1. 1

            I meant double-precision floats, obviously.


            The basic arithmetic operations are guaranteed to be deterministic and faithfully rounded. And my comment described circumstances under which floating-point math will give exact results. Floating-point math is not magical fairy dust that sprinkles imprecision over your code; it can be understood and tamed.

            I agree that in a kernel or on a float-free arch, it’s desirable to avoid float-free math, but both of those are the exception rather than the rule for most code.

            1. 2

              The basic arithmetic operations are guaranteed to be deterministic and faithfully rounded. … Floating-point math is not magical fairy dust that sprinkles imprecision over your code;

              This is not true for GPU programming, where float arithmetic is not guaranteed to follow the IEEE standard. Also, if you are coding to certain GPU APIs, like WebGPU or Metal, then you do not have access to 64 bit floats. With other APIs, like Vulkan, you have to test at runtime if doubles are available. On GPUs, float arithmetic is fast, but you give up reproducible results across different GPUs.

              1. 2

                It’s not about imprecision. One of the argument I made was determinism: depending on the system, arch, toolchain version, etc the float operations won’t be the same (different instructions, exploiting different implementation-free behavior, or differently implemented maths functions on the system, etc) and you may get different results depending on your environment. This makes tests much more complex typically (implies a threshold infrastructure of some sort, which is not always doable).