1. 15
    1. 6

      Interestingly, the reason that bounds checks in Rust are slow is the same reason that they’re slow in C++: that the bounds error must be reported precisely. This means that the compiler cannot hoist bounds checks out of loops (insert a simple ‘is the maximum value in this loop out of bounds?’ check). It could hoist the check and generate two versions of the code, one that checks bounds on every loop iteration and one that doesn’t. I had a student do this in Julia a few years ago and it got a 4-8x speed up on a lot of kernels but almost doubled code size. This didn’t matter in Julia, where the code is generally smallish compute kernels, but it would suck for C++ or Rust.

      Some of the work on safer C++ recently has adopted a different model: a bounds error (which is UB in C++ if you’re not using an explicitly bounds-checked interface) may fast fail (no exception, no handling possible) at any point after it becomes dynamically reachable. This means that you can hoist bounds checks out of loops and just branch to a trap instruction with a predicted-not-taken branch on failure, which is almost free. It does make code much harder to debug though, because inlining and code reordering can result in the trap happening long before the thing that is actually a problem.

      1. 2

        a bounds error may fast fail at any point after it becomes dynamically reachable

        This reminds me of “imprecise exceptions” in Haskell. This is a thing in GHC where pure expressions that could throw more than one exception may throw any of them arbitrarily. e.g. (error "a") + (error "b") may throw either. But throwIO is precise.

        I think this is mainly there to avoid having to make very specific guarantees about evaluation order, but my vague memory of skimming old presentations says maybe it also speeds things up a bit.

        1. 1

          I think this is mainly there to avoid having to make very specific guarantees about evaluation order, but my vague memory of skimming old presentations says maybe it also speeds things up a bit.

          These are probably related: making guarantees about execution order limits the optimisations that GHC can do and so will cause slowdown in at least some cases.

          Floating point units also often have (had?) imprecise exception modes so the FPU and CPU could retire instructions independently and the FPU could run fast and just notify the CPU that a floating point exception had occurred on some prior floating point instruction, without forcing the CPU to wait.

      2. 1

        It could hoist the check and generate two versions of the code, one that checks bounds on every loop iteration and one that doesn’t.

        You’re right that I wouldn’t necessarily want that for production code, but sounds pretty handy for debug builds…