1. 5
  1.  

  2. 10

    I strongly oppose describing optimizations enabled by “pointers are not integers” as “nice-to-have”. Without some assumptions about pointers, you can’t copy propagate local variables across function calls. This is very different from usual UB-exploiting optimizations and actually material to the performance.

    (If pointers are integers, you can guess the address of local variables and modify them, even if no address is taken. Compilers can assume this does not happen because pointers are not integers.)

    1. 3

      Agreed. Pointer aliasing is a huge problem for C-family compilers; IIRC I’ve heard it called the biggest roadblock in the way of more extensive optimization (And not just optimization; it screws up static analysis of program correctness too.)

      It’s pretty telling that every newer static/compiled language I can think of either makes it impossible or banishes it to an “unsafe” mode.

    2. 5

      This article seems to imply that the only reason why “pointers are complicated” is that they can be optimized; that the simple mental model of a pointer as an integer is essentially correct but optimization throws a spanner in the works.

      That’s incorrect though. Optimization isn’t fundamentally relevant to any of this. The issue is the C specification; in the C abstract machine, a pointer is not an integer. Pointers are their own magical creatures which behave in a way that’s different from integers, because they’re specified differently from integers. In C, pointers behave the way the C standard says pointers behave.

      The author might prefer the “simple and obvious pointer model” provided by the hardware they’re programming for, but the “simple and obvious pointer model” isn’t the pointer model used by the language they’re writing when they’re writing C. If your program depends on the “simple and obvious pointer model”, and does things which aren’t specified by the actual specification, it’s not a C program.

      The only reason optimization is relevant is that it has a higher probability of breaking incorrect C programs. No transformation the compiler does will ever break a correct C program (except for compiler bugs of course, but those are at least relatively rare). Notice that I said “higher probability”; optimization isn’t guaranteed to break an incorrect C program, and disabling optimization isn’t guaranteed to make an incorrect C program work. If your C program exhibits undefined behavior, such as relying on the “simple and obvious pointer model”, there is no way to know what it is supposed to do.

      1. 5

        The issue is the C specification

        Where do you think this specification come from? Who do you think wrote it? My guess right now would be mostly compiler writers.

        The reason for Undefined Behaviour shifted over time. Once, it was used to allow different platforms to behave differently (signed integer overflow), and to avoid dealing with hard (and uninteresting) problems with the memory model (stack overflow). Programs relying on certain types of undefined behaviour weren’t incorrect. They just relied on the characteristics of their platform, and compilers mostly respected that.

        At some point though, it became about optimisation. One emblematic moment, perhaps even the turning point, was strict aliasing. A new type of undefined behaviour just so we can optimise better. Somewhere along the line, compiler writers discovered that assuming signed integer overflow never happens lets them optimise some loops. That it ended up triggering actual security vulnerabilities was ultimately secondary.

        We now got to the point where avoiding undefined behaviour has become virtually impossible. Even for the most trivial piece of code, with no I/O and no dependencies, I managed to let a couple undefined behaviour slip through. One of them was an uninitialised read:

        uint32_t x;
        uint32_t y = 0;
        uint32_t z = x & y; // reading x is undefined
        

        You’d think z would be 0, but no, because x is not initialised. Makes no sense on almost every processor out there: no matter the value, it will be masked out, and the result is obviously zero. Thing is, on some platforms, x could possibly be initialised with a trap representation. Makes sense that reading it would be undefined on those platforms. But no, the standard in its infinite wisdom made it undefined for all platforms, leaving a breach open for optimizers.

        Fun fact about the C standard: there is no such thing as a behaviour being defined in some platform, and undefined in others. At least not directly, explicitly so. (-256 * -256 is undefined on 16 bit platforms, and defined in 32-bit platforms, but only because signed integer overflow happens only on one of them. Signed integer overflow in general is undefined on both.) I believe the reason for this is to avoid excessive nitpicking & pedantry. That spirit is gone. Compiler writers now conform to the letter of the specs, just so they can generate ever more efficient code.


        Before someone goes “but you can use -fwrapv” or some other well meaning crap, bear in mind that having an official, stamped standard makes for an overwhelmingly powerful default. If you rely on -fwrapv, you run a significant risk of using the code that rely on this option in a context that doesn’t use it. Because of that, such options might as well not exists at all in many contexts. One example being open source libraries: you never know how users are going to compile your code.

        1. 3

          … Sure. None of that changes anything. Fact is, when you’re writing C, you’re writing for the C Abstract Machine, and pointers in the C Abstract Machine aren’t integers. You may wish it wasn’t so, but “the claim that pointers are definitely not integers is wrong” is incorrect when talking about C.

          1. 3

            you’re writing for the C Abstract Machine

            Er, no. I am writing for a concrete machine. There is no C abstract machine, and if there were you couldn’t program it, as the C specification is insufficient. Hence all the undefined and implementation-defined behavior in the standard and the explicit admonishment in the standard that the standard is, by itself, insufficient to produce a compiler that is fit for purpose.

            In addition, it is also practically impossible to write standard-conforming code, at least for C++. I still boggle at one of the gentleman on the C++ standards committee coming right and saying that he works with the best C++ programmers in the world and none of them can write conforming code. And apparently not even getting the notion that maybe, just maybe, something might be wrong with said standard/spec.

            1. 2

              When you’re writing in the C language, you’re writing code which conforms to a specification. That specification is the C standard, and it is expressed in terms of the C abstract machine. Except for areas where the C standard specifically says something is implementation defined, the implementation is irrelevant.

              In some cases, the compiler you’re using might define additional behavior - for example, I think GCC is more concrete when it comes to the behavior around type punning through unions. If you’re writing to that, you’re writing in a GNU-flavored C, which is fine, but the same logic applies; you’re writing to a specification defined by the combination of the C standard and the GNU documentation. Anything not defined by those documents have no correct behavior, because there is nothing to define the behavior. And to my knowledge, GNU doesn’t define pointers in terms of the “simple and obvious pointer model”.

              You may not like it. You may think either GNU or the C committee or both are completely bonkers, and that GCC isn’t fit for purpose. That’s fine. That doesn’t change anything. Code which does something which isn’t defined by the specification of the language it’s written in (be that standard C or GNU C or Microsoft’s BASIC) has no “correct” behavior, so you shouldn’t expect your intuition about what the hardware might do to be correct. You’re not writing to the hardware, you’re writing to a specification.

              Heck, this is even true for assembly. Instructions do what the specification says they do, you can think of it he x86 specification as a document defining an “x86 abstract machine” which is implemented by a lot of real-world hardware. You may know that in your particular CPU, an instruction has a particular undocumented side-effect, and if you rely on that fact, you should expect your program to break on new CPUs or with updated microcode. If you write code which isn’t defined by the standard you’re writing to, you have no guarantees about what will happen, period.

              It’s true for JavaScript too. JS tries its best to define most behavior, but if your code does something where the JavaScript standard has no opinion on what should happen, you can’t be surprised when your program doesn’t do what you expect. Again, you can dislike the C standard for having much more undefined behavior than JavaScript, but that just means you don’t like C very much, which is fine.

              Now, one thing which would be cool would be a C implementation which defines a bunch of additional behavior. Instead of using standard C, one could have an implementation which defines a bunch of the behavior which is undefined in standard C. GCC lets you do this to some degree; for example, you could pass in -fno-strict-aliasing and write in a C flavor where the C standard’s aliasing rules don’t apply and any variable can alias any other variable, or you can pass -fno-strict-overflow and write in a C flavor which defines signed integer overflow. Maybe it would be possible to get GNU to add a -fno-strict-pointer-semantics flag, which would actually define pointers to work as they would in your “simple and obvious pointer model”. Who knows, maybe there are already flags to that effect. Just know that this new more well-defined language isn’t C, so the statement “the claim that pointers are definitely not integers is wrong” would continue to be wrong in the context of C.

              1. 1

                When you’re writing in the C language, you’re writing code which conforms to a specification. That specification is the C standard

                No, no, and no. Emphatically no. This belief demonstrates an extraordinary ignorance of the history, pragmatics, and simple reality of programming in general, and of C in particular.

                1. “Specification”

                I bought my first C compiler in 1986, Manx Aztec C for the newly released Amiga 1000, of which I got one of the first in Germany (was an NTSC model). Considering the fact that the first C standard, ANSI C89 was ratified in 1989 and that the Vulcan Science Directorate has determined that time travel is impossible, what exactly was I doing?

                1.1 Specification, for real this time

                Leaving aside what I, and the thousands if not millions of C programmers before me, were actually doing before there was the ANSI/ISO C standard, if, as you claim, programming is writing code which conforms to the C standard, the appearance of said standard must have clearly profoundly changed what I was doing. How could it not, if that is what writing C programs is?

                Did. Not. Happen.

                Yes, some things changed. The compiler got a new version that supported function prototypes. Nice. But I used those because they were supported by the concrete compiler, not the “abstract C machine”.

                1. The concrete machine

                While your point about writing for the abstract machine is generally untrue, even in languages that have a definition/specification of the type you imagine, it is particularly not true for C. As I wrote before, you write for a concrete machine, to make this concrete machine execute instructions and produce a desired effect. An abstract machine cannot do this, and is thus generally useless and uninteresting for most programming tasks.

                This is the pragmatics of programming. Semantics can be of theoretical interest, but even with semantics there is no reason to prefer a denotational semantics over an operational semantics on principle. You are of course free to have a personal preference.

                That was the general case. For C, the idea of an “abstract machine” defined by the standard that you actually program against is…odd. First, that is not what the standard is for, and second it is not something the standard provides, as it is intentionally permissive. How many bits is int on your “abstract C machine”? So you’re going to need a concrete machine. (And yes, the program can ask, but even that has to be executed on a concrete for the size to actually return a value).

                1. “Writing”

                You make it sound as if the purpose of the C standard was to constrain (or guide) programmers to writing code that conforms to this specification. Is this actually so? Let’s see what the standard has to say about the motivation:

                The need for a single clearly defined standard had arisen in the C community due to a rapidly expanding use of the C programming language and the variety of differing translator implementations that had been and were being developed. The existence of similar but incompatible implementations was a serious problem for program developers who wished to develop code that would compile and execute as expected inseveral different environments.

                Foreword, Draft ANSI C Standard

                (Emphasis mine).

                So, no, the specification was there to constrain/guide “translator implementations”. Not programmers. The C standard is not really a document for programmers.

                Or have a look at the rationale document on the standard committe website ( pdf. )

                Existing code is important, existing implementations are not. A large body of C code exists of considerable commercial value. Every attempt has been made to ensure that the bulk of this code will be acceptable to any implementation conforming to the Standard. The C89 Committee did not want to force most programmers to modify their C programs just to have them accepted by a conforming translator.

                More precisely:

                C code can be non-portable. Although it strove to give programmers the opportunity to write truly portable programs, the C89 Committee did not want to force programmers into writing portably, to preclude the use of C as a “high-level assembler”: the ability to write machine-specific code is one of the strengths of C. It is this principle which largely motivates drawing the distinction between strictly conforming program and conforming program (§4).

                Anyway, on we go. You write:

                Anything not defined by those documents have no correct behavior, because there is nothing to define the behavior.

                This is patently untrue. There are concrete machines and concrete implementations that give an operational semantics of the behavior. And this is explicitly allowed by the C standard. Back to the rationale document:

                The terms unspecified behavior, undefined behavior, and implementation-defined behavior are used to categorize the result of writing programs whose properties the Standard does not, or cannot, completely describe. The goal of adopting this categorization is to allow a certain variety among implementations which permits quality of implementation to be an active force in the marketplace as well as to allow certain popular extensions, without removing the cachet of conformance to the Standard.

                Back to you:

                GNU doesn’t define pointers in terms of the “simple and obvious pointer model”.

                Of course it doesn’t! Machines and architectures do this, as I explained in the article. The C standard allows implementations that do not follow this model, the standard creators were, again, very explicit about allowing a wide variety of implementations. However, the vast majority of CPUs and OSes do follow this “flat (virtual) address space” model that allows pointers to be integers, and this was not a coincidence, but a deliberate move away from other memory and pointer models. And when the machine follows this model, GNU C follows this model as well.

                [you may think] that GCC isn’t fit for purpose

                That wasn’t what I wrote. At all. What I wrote is that the standard creators (not me) were also quite explicit that conformance of an implementation to the standard alone is insufficient for an implementation to be fit for purpose, and that this is by design. Whether you find any existing implementation fit for purpose or not is a different matter, but standard-conformance itself is not sufficient (you may consider it necessary). The standard was there to “codify existing practice” and this existing practice included a wide variety of implementations and architectures, which weren’t supposed to be excluded from being standard compliant (”…removing the cachet of conformance to the Standard”) because the architecture was different.

                The standard couldn’t define that a pointer is an integer, because that would make it impossible for compilers for architectures where pointers aren’t just integers (x86 segmented, for examples) to ever achieve “..the cachet of conformance to the Standard”. However, on the architectures where a pointer is an integer, sometimes a long integer (PDP-11, M68K, 386/flat, x86_64, ARM, ARM64, etc.) C doesn’t prevent it from being an integer. In fact, it comes pretty close to saying that it actually is an integer:

                Conversions that involve pointers (other than as permitted by the constraints of $3.3.16.1) shall be specified by means of an explicit cast; they have implementation-defined aspects: A pointer may be converted to an integral type. The size of integer required and the result are implementation-defined. If the space provided is not long enough, the behavior is undefined. An arbitrary integer may be converted to a pointer.

                (That’s from C89). Anyway, back to you:

                Instructions do what the specification says they do

                No, no, no. Instructions do whatever the hardware has implemented. This may and hopefully does match the specification, or rather the documentation. But if the specification and the hardware disagree, what is actually executed is what the hardware does, not what the specification says.

                Operational semantics always win. Always.

                1. 1

                  I’m all for acknowledging the state of reality. The way you phrase it however sounds like you might be endorsing the status quo, which is a markedly different stance.

                  When you’re writing in the C language, you’re writing code which conforms to a specification.

                  Many game developers don’t. See everything data oriented, which cares deeply about unspecified performance characteristics borne out of the cache hierarchy, among other things.

                  Now, one thing which would be cool would be a C implementation which defines a bunch of additional behavior.

                  Precisely the thing you cannot do in many cases. Like when you’re writing an open source library. You could always say “only works with such and such flag”, but people are going to screw it up at some point.

                  1. 2

                    Right, sorry. It wasn’t my intention to endorse the status quo. While I understand the need for UB, and like compiler optimizations, I spend way too much time being a language lawyer or listening to other language lawyers in ##C (or ##C++-general) to figure out how to accomplish an absolutely trivial thing in a legal way. Most recently, I had a loong discussion about how one would stack-allocate a struct with a flexible array member.

                    I don’t like the current state of affairs, but they are what they are.

                    Many game developers don’t. See everything data oriented, which cares deeply about unspecified performance characteristics borne out of the cache hierarchy, among other things.

                    That’s not really the same. Data-oriented design is still writing to the specification; the code will have the same result on all implementations. It’s just that some (well, most modern) CPUs happen to run the code faster with certain access patterns.

                    Precisely the thing you cannot do in many cases. Like when you’re writing an open source library. You could always say “only works with such and such flag”, but people are going to screw it up at some point.

                    I think it would be possible, through a great deal of effort. It’s a branding challenge more than a technical challenge.

                    You can’t just document that your library requires certain flags, for the reasons you outline. However, we could start a monumental task of creating a new language, say “Defined C”, in which a lot of currently-undefined behavior gets defined, and then we can add specific support for “Defined C” to a bunch of different compilers.

                    After years of rolling out new compiler versions and an immense branding effort, with a technical specification which extends C, a visually appealing website and blog posts which work to convince people why the current state of C is untenable for most people, libraries could start saying that they aren’t C libraries, but libraries written for and in the language “Defined C”.

                    It would be more akin to popularizing a new language than adding compiler switches. I’m not gonna do it, but I think it’s possible. Especially since a library written in “Defined C” would still be callable from standard C, assuming it kept all non-ISO-standard code in source files and not in headers.

        2. 5

          Every time UB and optimizations are discussed, the proposal is to just not exploit it, just not do the weird things.

          But then C programmers use gcc and clang, and not tcc. They look at assembly output and complain when it does “unnecessary” things.

          There just doesn’t seem to be an actual market for poorly optimizing compilers that compile for PDP-11 machine model, rather than a whacky OoOE-with-SIMD model.

          1. 1

            C programmers use gcc and clang, and not tcc

            I would love to use tcc. Alas, it doesn’t support Objective-C, doesn’t support Mach-O, is no longer effectively maintained and had its origins as the winner of the obfuscated c code contest…

            Xcode doesn’t come with tcc, it comes with clang, as does the NDK. And of course there is no actual “market” to speak of, there are two “free” compilers that ensure there cannot be an actual market.

            1. 1

              doesn’t support Mach-O

              I’m not on a Mac anymore but it seems you can at least emit Mach-O: https://lists.nongnu.org/archive/html/tinycc-devel/2020-06/msg00010.html

              is no longer effectively maintained

              Even though the last tagged release is about 2 years old there is quite a bit of activity in the repository: https://repo.or.cz/git-browser/by-commit.html?r=tinycc.git (Attention: website needs JavaScript)

              there are two “free” compilers that ensure there cannot be an actual market

              Mostly agree. But there still are niches like avionics and embedded.

              1. [Comment removed by author]