1. 16
  1.  

  2. 12

    C is that way because reality is that way

    I think this misses the fact that the C abstract machine increasingly does not correspond to what computers actually do. “Reality” isn’t really “that way” at all. The more true answer is “because it is convenient to define an abstraction that is close to assembly (which itself is still an abstraction)”.

    See C is not a Low-Level Language.

    1. 4

      This is actually not a bad rundown, though I feel like the discussion of UB lacks the correct nuance. When referring to integer overflow:

      The GNU C compiler (gcc) generates code for this function which can return a negative integer

      No, it doesn’t “return a negative integer”, it has already hit undefined-behaviour-land by that point. The program might appear to behave as if a negative integer was returned, but may not do so consistently, and that is different from having a negative integer actually returned, especially since the program might even exhibit odd behaviours that don’t correspond to the value being negative or the arithmetically correct value, or which don’t even appear to involve the value at all. (Of course, at the machine level, it might do a calculation which stores a negative result into a register or memory location; but, that’s the wrong level to look at it, because the presence of the addition operation has effects on compiler state that can affect code generation well beyond that one operation. Despite the claim being made often, C is not a “portable assembler”. I’m glad this particular article doesn’t make that mistake).

      1. 3

        What? The code in question:

        int f(int n)
        {
            if (n < 0)
                return 0;
            n = n + 100;
            if (n < 0)
                return 0;
            return n;
        }
        

        What the article is saying is that on modern C compilers, the check for n < 0 indicates to the compiler that the programmer is rejecting negative numbers, and because programmers never invoke undefined behavior (cough cough yeah, right) the second check when n < 0 can be removed because of course that can’t happen!

        So what can actually happen in that case? An aborted program? Reformatted hard drive? Or a negative number returned from f() (which is what I suspect would happen in most cases)? Show generated assembly code to prove or disprove me please … (yes, I’m tired of C language lawyers pedantically warning about possible UB behavior).

        1. 3

          because programmers never invoke undefined behavior

          They shouldn’t, but they often do. That’s why articles such as the one in title should be super clear about the repercussions.

          So what can actually happen in that case?

          Anything - that’s the point. That’s what the “undefined” in “undefined behaviour” means.

          (yes, I’m tired of C language lawyers pedantically warning about possible UB behavior).

          The issue is that a lot of this “possible UB behaviour” is actual compiler behaviour, but it’s impossible to predict which exact behaviour you’ll get.

          You might be “tired of C language lawyers pedantically warning about possible UB behaviour”, but I’m personally tired of programmers invoking UB and thinking that it’s ok.

          1. 1

            They shouldn’t, but they often do.

            Yes they do, but only because there’s a lot of undefined behaviors in C. The C standard lists them all (along with unspecified, implementation and locale-specific behaviors). You want to know why they often do? Because C89 defined about 100 undefined behaviors, C99 about 200 and C11 300. It’s a bit scary to think that C code that is fine today could cause undefined behavior in the future—I guess C is a bit like California; in California everything causes cancer, and in C, everything is undefined.

            A lot historically came about because of different ways CPUs handle certain conditions—the 80386 will trap any attempt to divide by 0 [1] but the MIPS chip doesn’t. Some have nothing to do with the CPU—it’s undefined behavior if a C file doesn’t end with a new line character. Some have to do with incorrect library usage (calling va_arg() without calling va_start()).

            I’m personally tired of programmers invoking UB and thinking that it’s ok.

            Undefined behavior is just that—undefined. Most of the undefined behavior in C is pretty straightforward (like calling va_arg() incorrectly), it’s really only signed-integer math and pointers where most of the problems with undefined behavior is bad. Signed-integer math is bad only in that it might generate invalid indices for arrays or for pointer arithmetic (I mean, incorrect answers are still bad, but I’m more thinking of security here). Outside of that, I don’t know of any system in general use today that will trap on signed overflow [2]. So I come back to my original “What?” question. The x86 and ARM architectures have well defined signed integer semantics (they wrap! I’ve yet to come across a system where that doesn’t happen, again [2]) so is it any wonder that programmers will invoke UB and think it’s okay?

            And for pointers, I would hazard a guess that most programmers today don’t have experience with segmented architectures which is where a lot of the weirder pointer rules probably stem from. Pointers by themselves aren’t the problem per se, it’s C’s semantics with pointers and arrays that lead to most, if not all, problems with undefined behavior with pointers (in my opinion). Saying “Oh! Undefined behavior has been invoked! Abandon all hope!” doesn’t actually help.

            [1] IEEE-754 floating point doesn’t trap on division by 0.

            [2] I would love to know of a system where signed overflow is trapped. Heck, I would like to know of a system where trap representations exist! Better yet, name the general purpose systems I can buy new, today, that use sign magnitude or 1s-complement for integer math.

            1. 2

              Because C89 defined about 100 undefined behaviors, C99 about 200 and C11 300

              It didn’t define them; It listed circumstances which have undefined behaviour. This may seem nit-picky, but the necessity of correctly understanding what is “undefined behaviour” is the premise of my original post.

              A draft of C17 that I have lists 211 undefined behaviours listed. An article on UB - https://www.cs.utah.edu/~regehr/ub-2017-qualcomm.pdf - claims 199 for C11. I don’t think your figure of 300 is correct.

              A bunch of the C11 circumstances for UB are to do with the multi-threading support which didn’t exist in C99. In general I don’t think there’s any strong reason to believe that code with clearly well-specified behaviour now will have UB in the future.

              So I come back to my original “What?” question

              It’s not clear to me what your “what?” question is about. I elaborated in the first post on what I meant by “No, it doesn’t “return a negative integer””.

              Compilers will for eg. remove checks for impossible (in the absence of UB) conditions and other things that may be even harder to predict; C programmers should be aware of that.

              Now, if you want to argue “compilers shouldn’t do that”, I wouldn’t necessarily disagree. The problem is: they do it, and the language specification makes it clear that they are allowed to do it.

              The x86 and ARM architectures have well defined signed integer semantics

              so is it any wonder that programmers will invoke UB and think it’s okay?

              This illustrates my point: if we allow the view of C as a “portable assembly language” to be propagated, and especially the view of “UB is just the semantics of the underlying architecture”, we’ll get code being produced which doesn’t work (and worse, is in some cases exploitable) when compiled by today’s compilers.

              1. 1

                I don’t think your figure of 300 is correct.

                You are right. I recounted, and there are around 215 or so for C11. But there’s still that doubling from C89 to C99.

                No, it doesn’t “return a negative integer”, it has already hit undefined-behaviour-land by that point.

                It’s not clear to me what your “what?” question is about.

                Unless the machine in question traps on signed overflow, the code in question returns something when it runs. Just saying “it’s undefined behavior! Anything can happen!” doesn’t help. The CPU will either trap, or it won’t. There is no third thing that can happen. An argument can be made that CPUs should trap, but the reality is nearly every machine being programmed today is a byte-oriented, 2’s complement machine with defined signed overflow semantics.

                1. 1

                  Just saying “it’s undefined behavior! Anything can happen!” doesn’t help

                  It makes it clear that you should have no expectations on behaviour in the circumstance - which you shouldn’t.

                  Unless the machine in question traps on signed overflow, the code in question returns something when it runs.

                  No, as already evidenced, the “result” can be something that doesn’t pass the ‘x < 0’ check yet displays as a negative when printed, for example. It’s not a real value.

                  The CPU will either trap, or it won’t

                  C’s addition doesn’t map directly to the underlying “add” instruction of the target architecture; it has different semantics. It doesn’t matter what the CPU will or won’t do when it executes an “add” instruction.

        2. 1

          Yes, the code generated does in fact return a negative integer. You shouldn’t rely on it, another compiler may do something different. But once compiled undefined behaviour isn’t relevant anymore. The generated x86 does in fact contain a function that may return a negative integer.

          Again, it would be completely legal for the compiler to generate code that corrupted memory or ejected your CD drive. But this statement is talking about the code that happened to be generated by a particular run of a particular compiler. In this case it did in fact emit a function that may return a negative number.

          1. 1

            When we talk about undefined behaviour, we’re talking about the semantics at the level of the C language, not the generated code. (As you alluded, that wouldn’t make much sense.)

            At some point you have to map semantics between source and generated code. My point was, you can’t map the “generates a negative value” of the generated code back to the source semantics. We only say it’s a negative value on the basis that its representation (bit pattern) is that of a negative value, as typically represented in the architecture, and even then we’re assuming that for instance some register (for example) that is typically used to return values does in fact hold the return value of the function …

            … which it doesn’t, if we’re talking about the source function. Because that function doesn’t return once undefined behaviour is invoked; it ceases to have any defined behaviour at all.

            I know this is highly conceptual and abstract, but that’s at the heart of the message - C semantics are at a higher level than the underlying machine; it’s not useful to think in terms of “undefined behaviour makes the function return a negative value” because then we’re imposing artificial constraints on undefined behaviour and what it is; from there, we’ll start to believe we can predict it, or worse, that the language semantics and machine semantics are in fact one-to-one.

            I’ll refer again to the same example as was in the original piece: the signed integer overflow occurs and is followed by a negative check, which fails (“is optimised away by the compiler”, but remember that optimisation preserves semantics). So, it’s not correct to say that the value is negative (otherwise it would have been picked up by the (n < 0) check); it’s not guaranteed to behave as a negative value. It’s not guaranteed to behave any way at all.

            Sure, the generated code does something and it has much stricter semantics than C. But saying that the generated function “returns a negative value” is lacking the correct nuance. Even if it’s true that in some similar case, the observable result - from some particular version of some particular compiler for some particular architecture - is that the number always appears to be negative, this is not something we should in any way suggest is the actual semantics of C.

          2. 0

            Of course, at the machine level, it might do a calculation which stores a negative result into a register or memory location; but, that’s the wrong level to look at it, because the presence of the addition operation has effects on compiler state that can affect code generation well beyond that one operation.

            Compilers specifically have ways of ensuring that there is no interference between operations, so no. This is incorrect. Unless you want to point to the part of the GCC and Clang source code that decides unexpectedly to stop doing that?

            1. 1

              In the original example, the presence of the addition causes the following negative check (n < 0) to be omitted from the generated code.

              Unless you want to point to the part of the GCC and Clang source code that decides unexpectedly to stop doing that?

              If that’s at all a practical suggestion, perhaps you can go find the part that ensures “that there is no interference between operations” and point that out?

              1. 1

                In the original example, the presence of the addition causes the following negative check (n < 0) to be omitted from the generated code.

                Right, because register allocation relies upon UB for performance optimization. It’s the same in both GCC and Clang (Clang is actually worse with regards to it’s relentless use of UB to optimize opcode generation, presumably this is also why they have more tooling around catching errors and sanitizing code). This is a design feature from the perspective of compiler designers. There is absolutely nothing in the literature to back up your point that register allocation suddenly faceplants on UB – I’d be more than happy to read it if you can find it, though.

                If that’s at all a practical suggestion, perhaps you can go find the part that ensures “that there is no interference between operations” and point that out?

                *points at the entire register allocation subsystem*

                But no, the burden of proof is on you, as you made the claim that the register allocator and interference graph fails on UB. It is up to you to prove that claim. I personally cannot find anything that backs your claim up, and it is common knowledge (backed up by many, many messages about this on the mailing list) that the compiler relies on Undefined Behaviour.

                Seriously, I want to believe you. I would be happy to see another reason of why having the compiler rely on UB is a negative point. For this reason I also accept a code example where you can use the above example of UB to cause the compiler to clobber registers and return an incorrect result. The presence of a negative number alone is not sufficient as that does not demonstrate register overwriting.

                1. 2

                  There is absolutely nothing in the literature to back up your point that register allocation suddenly faceplants on UB

                  What point? I think you’ve misinterpreted something.

                  you made the claim that the register allocator and interference graph fails on UB

                  No, I didn’t.

                2. 1

                  It isn’t the addition; the second check is omitted because n is known to be greater than 0. Here’s the example with value range annotations for n.

                  int f(int n)
                  {
                      // [INT_MIN, INT_MAX]
                      if (n < 0)
                      {
                          // [INT_MIN, -1]
                          return 0;
                      }
                      // [0, INT_MAX]
                      n = n + 100;
                      // [100, INT_MAX] - overflow is undefined so n must be >= 100 
                      if (n < 0)
                      {
                          return 0;
                      }
                      return n;
                  }
                  
                  1. 2

                    You’re correct that I oversimplified it. The tone of the person I responded to was combative and I couldn’t really be bothered going into detail again one something that I’ve now gone over several times in different posts right here in this discussion.

                    As you point out it’s the combination of “already compared to 0” and “added a positive integer” that makes the final comparison to 0 redundant. The original point, that the semantics of C, and in particular the possibility of UB, mean that a simple operation can affect later code generation.

                    Here’s an example that works without interval analysis: (edit: or rather, that requires slightly more sophisticated analysis):

                    int f(int n)
                    {
                        int orig_n = n;
                        n = n + 100;
                        if (n < orig_n)
                        {
                            return 0;
                        }
                        return n;
                    }
                    

            Stories with similar links:

            1. The Descent to C via pushcx 7 years ago | 5 points | no comments
            2. The Descent to C via kmatt 7 years ago | 22 points | 11 comments