I mean. It’s not like these optimizing compilers were something programmers didn’t ask for. We chose to use aggressively optimizing compilers and eschew ones that focused more on being simple and predictable.
The problem is not the eagerness and aggression with which the compiler optimizes. The problem regarding undefined behavior is shaped by the gap between the logic the programmer intends to express and the logic that the compiler understands, which is vast in the case of C. It is this gap which causes confusion and surprises, because the programmer clearly intended one behavior, but the compiler was smart enough to detect a crack in that logic and optimized the whole thing out. It is a problem of communication and expressiveness of the language and it is possible to write languages and compilers that optimize aggressively without becoming adversarial logic corruption machines.
As a C and C++ programmer UB is something I rarely think about during day to day development. Maybe it has just been decades of mental memory to know what to avoid. Or maybe I’ve written code that uses UB without knowing? I don’t think I’ve ever read a comment in a C or C++ code base where someone indicated that some segment of code invokes UB but that was their only option. So I think I’m not unusual in this way. Maybe such comments are in compilers, though.
I feel like UB is thrown around as some scary thing on HN and this site but it’s talked about much less on the C and C++ subreddits.
In my experience people don’t generally realize that they’re invoking UB. I primarily work with C and C++, and often the first thing I’ll do when I start on a new project is build and run any unit tests with -fsanitize=undefined. It pretty much always finds something, unless the people working on the system before me were already doing something similar, and people are often surprised.
I mean yes it is ultimately a language problem that C UB is so broadly defined and stuff. That said, you actually do want compilers to optimize out checks that they can prove aren’t necessary, and in order to do that they need rules for what you can and can’t do. Unsafe Rust is eventually going to run into a lot of these same issues as the model for what’s okay and what isn’t gets more and more sophisticated and the compiler learns to better take advantage of it.
I was actually just saying this to someone today: it certainly seems true that 99% of C code written doesn’t actually “properly” target the abstract machine as defined by the ISO C specification. But it’s also long occurred to me that the probable reason that the ISO C committee came up with the “abstract machine” idea, is probably that it was the only obvious way forward in the face of the fundamental challenge of trying to standardise a language which tends to be used in fundamentally platform and architecture-specific ways. In this regard though, the abstract machine could maybe be thought of as a kind of “strawman” from the very beginning: you essentially can’t standardise C as it’s actually used, so they defined something sufficiently abstract that it could be standardised, and then standardised that (but which didn’t particular accord with any reality of how C is used, then or now).
The fact that compiler writers are now using the ISO C abstract machine as licence to do interesting things to code which triggers UB, when that code may even precede ISO C, is rather ironic. But it seems pretty clear to me that the fundamental issue here is the limits to the utility of trying to standardise a language like C in this regard. The abstract machine was probably the only reasonable solution, but doesn’t actually work (from a human systems perspective), as demonstrated by the vast majority of C written today going beyond the boundaries of it. One might argue whether there’s even really a utility for a standard nobody is truly willing to target for nontrivial programs.
I have to wonder if a better option would be to define the portable parts of C, and basically change all (or almost all) instances of UB in the spec to “implementation-defined”, then specify that the C standard cannot be used alone but must be combined with some kind of platform-specific profile specification. One such specification could be for what we today think of as “normal” platforms and guarantee things like twos complement arithmetic, while allowing the vast majority (by userbase) of platforms people want to target to be supported.
Considering I see people go, “oh yeah, my C program is very portable” and have it use OpenFile/open instead of fopen or using custom arena allocators that assume kernel memory mapping characteristics, I really can’t take it seriously when people claim it. Even little things like C99 have an MSVC-sized truck drive through it (there goes long long), let alone things like the “EBCDIC WebAssembly” kind of hosts I work on.
The C abstract machine is not a carefully invented thing that people then built implementations of, an end in and of itself; it started out as a neutral explanation and justification of how actual existing C things behaved, a means to an end.
Unfortunately, we’re stuck with the nazi-strict readings of the spec and resulting rationalizations that compiler writers clung to in order to eke out more performance, and this effectively made C much worse. C as it is practiced by the major compilers today definitely deserves its reputation as an insecure language that only a language lawyer would be able to wrangle correctly into doing what they want.
The ++ and – operators, with their prefix and postfix variants, were created as exact equivalents of addressing modes available on the PDP11 that C originally compiled to. (Or was it a PDP7? I forget.)
The PDP 7 was an 18 bit machine, and not byte addressable. It typically used 6 bit character codes, so you could pack 3 into a word. The first Unics was written in assembly language for this machine. The B language was also implemented (predecessor of C). B was typeless: all values were 18 bit words. The B language had pre and post increment (so this wasn’t copied from the PDP-11).
When they reimplemented Unics on the PDP 11 and renamed it Unix, the B language was ported, but wasn’t a good match to the CPU architecture. The PDP 11 was 16 bit, it was byte addressable, and it used ASCII. Strings were represented as 1 character per byte, and bytes had addresses. So they needed types to distinguish 8 and 16 bit values. A new language was born, called C.
The PDP11 has 8 addressing modes (encoded as 3 bits within an instruction), and several of these modes implement post increment and post decrement. The pre and post increment operators inherited from B mapped naturally onto these addressing modes.
At the time, systems programming languages were tightly tied to specific CPU architectures. There was too much variance between CPU architectures for it to be otherwise. A systems programmer was expected to know assembly language, and a high level systems language would closely reflect features of the CPU instruction set. You would write code with some expectation of what assembly language would be generated. The PDP-11 did influence C. One example is the string representation and the types char and char*.
Here’s a PDP11-ism. The char type was originally defined as an 8 bit signed integer because when you load a byte from memory into a 16 bit register on the PDP-11, it is sign extended, and emitting an additional instruction to clear the high order 8 bits was considered too expensive.
Some later ports of C implemented char as unsigned, because that was cheap on those architectures.
The standards committee dealt with this by allowing char to be either signed or unsigned, and then added two new types, both distinct from char: these were signed char and unsigned char.
There are also “deferred” variants which work on pointers.
Based on this it is plausible to me that p++ and --p were added to C because they mapped directly to these addressing modes, and that ++p and p-- were added to round things out.
“Thompson went a step further by inventing the + + and - - operators, which increment or decrement;
their prefix or postfix position determines whether the alteration occurs before or after noting the value of the operand. They were not in the earliest versions of B, but appeared along the way. People
often guess that they were created to use the auto-increment and auto-decrement address modes
provided by the DEC PDP-11, on which C and Unix first became popular. This is historically
impossible, inasmuch as there was no PDP-11 when B was developed. The PDP-7, however, did have
a few “auto-increment” memory cells, with the property that an indirect memory reference through
them incremented the cell. This feature probably suggested such operators to Thompson; the
generalization to make them both prefix and postfix was his own. Indeed, the auto-increment cells
were not used directly in implementation of the operators, and a stronger motivation for the innovation
was probably his observation that the translation of ++x was smaller than that of x=x+l.”
This is good, but the problem is that Ritchie is just guessing about how Thompson was inspired to invent the ++ and -- operators. The PDP-7 and PDP-11 were not the only computers to have auto-increment and auto-decrement addressing modes. It was common feature. Thompson worked on the Multics project before creating “Unics” on the PDP-7 (later renamed to Unix). Multics originally ran on the GE-645 computer. This was a 36 bit computer with auto-increment and auto-decrement addressing modes.
I mean. It’s not like these optimizing compilers were something programmers didn’t ask for. We chose to use aggressively optimizing compilers and eschew ones that focused more on being simple and predictable.
The problem is not the eagerness and aggression with which the compiler optimizes. The problem regarding undefined behavior is shaped by the gap between the logic the programmer intends to express and the logic that the compiler understands, which is vast in the case of C. It is this gap which causes confusion and surprises, because the programmer clearly intended one behavior, but the compiler was smart enough to detect a crack in that logic and optimized the whole thing out. It is a problem of communication and expressiveness of the language and it is possible to write languages and compilers that optimize aggressively without becoming adversarial logic corruption machines.
Just a note on UB…
As a C and C++ programmer UB is something I rarely think about during day to day development. Maybe it has just been decades of mental memory to know what to avoid. Or maybe I’ve written code that uses UB without knowing? I don’t think I’ve ever read a comment in a C or C++ code base where someone indicated that some segment of code invokes UB but that was their only option. So I think I’m not unusual in this way. Maybe such comments are in compilers, though.
I feel like UB is thrown around as some scary thing on HN and this site but it’s talked about much less on the C and C++ subreddits.
In my experience people don’t generally realize that they’re invoking UB. I primarily work with C and C++, and often the first thing I’ll do when I start on a new project is build and run any unit tests with
-fsanitize=undefined
. It pretty much always finds something, unless the people working on the system before me were already doing something similar, and people are often surprised.I mean yes it is ultimately a language problem that C UB is so broadly defined and stuff. That said, you actually do want compilers to optimize out checks that they can prove aren’t necessary, and in order to do that they need rules for what you can and can’t do. Unsafe Rust is eventually going to run into a lot of these same issues as the model for what’s okay and what isn’t gets more and more sophisticated and the compiler learns to better take advantage of it.
I was actually just saying this to someone today: it certainly seems true that 99% of C code written doesn’t actually “properly” target the abstract machine as defined by the ISO C specification. But it’s also long occurred to me that the probable reason that the ISO C committee came up with the “abstract machine” idea, is probably that it was the only obvious way forward in the face of the fundamental challenge of trying to standardise a language which tends to be used in fundamentally platform and architecture-specific ways. In this regard though, the abstract machine could maybe be thought of as a kind of “strawman” from the very beginning: you essentially can’t standardise C as it’s actually used, so they defined something sufficiently abstract that it could be standardised, and then standardised that (but which didn’t particular accord with any reality of how C is used, then or now).
The fact that compiler writers are now using the ISO C abstract machine as licence to do interesting things to code which triggers UB, when that code may even precede ISO C, is rather ironic. But it seems pretty clear to me that the fundamental issue here is the limits to the utility of trying to standardise a language like C in this regard. The abstract machine was probably the only reasonable solution, but doesn’t actually work (from a human systems perspective), as demonstrated by the vast majority of C written today going beyond the boundaries of it. One might argue whether there’s even really a utility for a standard nobody is truly willing to target for nontrivial programs.
I have to wonder if a better option would be to define the portable parts of C, and basically change all (or almost all) instances of UB in the spec to “implementation-defined”, then specify that the C standard cannot be used alone but must be combined with some kind of platform-specific profile specification. One such specification could be for what we today think of as “normal” platforms and guarantee things like twos complement arithmetic, while allowing the vast majority (by userbase) of platforms people want to target to be supported.
ANS Forth did just that, with documentation requirements from both the compiler and the code being compiled.
Considering I see people go, “oh yeah, my C program is very portable” and have it use OpenFile/open instead of fopen or using custom arena allocators that assume kernel memory mapping characteristics, I really can’t take it seriously when people claim it. Even little things like C99 have an MSVC-sized truck drive through it (there goes
long long
), let alone things like the “EBCDIC WebAssembly” kind of hosts I work on.The money quote:
Unfortunately, we’re stuck with the nazi-strict readings of the spec and resulting rationalizations that compiler writers clung to in order to eke out more performance, and this effectively made C much worse. C as it is practiced by the major compilers today definitely deserves its reputation as an insecure language that only a language lawyer would be able to wrangle correctly into doing what they want.
Another quote, with a good link:
The ++ and – operators, with their prefix and postfix variants, were created as exact equivalents of addressing modes available on the PDP11 that C originally compiled to. (Or was it a PDP7? I forget.)
The PDP 7 was an 18 bit machine, and not byte addressable. It typically used 6 bit character codes, so you could pack 3 into a word. The first Unics was written in assembly language for this machine. The B language was also implemented (predecessor of C). B was typeless: all values were 18 bit words. The B language had pre and post increment (so this wasn’t copied from the PDP-11).
When they reimplemented Unics on the PDP 11 and renamed it Unix, the B language was ported, but wasn’t a good match to the CPU architecture. The PDP 11 was 16 bit, it was byte addressable, and it used ASCII. Strings were represented as 1 character per byte, and bytes had addresses. So they needed types to distinguish 8 and 16 bit values. A new language was born, called C.
The PDP11 has 8 addressing modes (encoded as 3 bits within an instruction), and several of these modes implement post increment and post decrement. The pre and post increment operators inherited from B mapped naturally onto these addressing modes.
At the time, systems programming languages were tightly tied to specific CPU architectures. There was too much variance between CPU architectures for it to be otherwise. A systems programmer was expected to know assembly language, and a high level systems language would closely reflect features of the CPU instruction set. You would write code with some expectation of what assembly language would be generated. The PDP-11 did influence C. One example is the string representation and the types
char
andchar*
.Here’s a PDP11-ism. The
char
type was originally defined as an 8 bit signed integer because when you load a byte from memory into a 16 bit register on the PDP-11, it is sign extended, and emitting an additional instruction to clear the high order 8 bits was considered too expensive.Some later ports of C implemented
char
as unsigned, because that was cheap on those architectures.The standards committee dealt with this by allowing
char
to be either signed or unsigned, and then added two new types, both distinct fromchar
: these weresigned char
andunsigned char
.I think that’s an urban legend.
p++
and--p
look pretty close to the autoincrement and autodecrement addressing modes:https://archive.org/details/bitsavers_decpdp11hak1979_22549786/page/n39/mode/2up
There are also “deferred” variants which work on pointers.
Based on this it is plausible to me that
p++
and--p
were added to C because they mapped directly to these addressing modes, and that++p
andp--
were added to round things out.“Thompson went a step further by inventing the + + and - - operators, which increment or decrement; their prefix or postfix position determines whether the alteration occurs before or after noting the value of the operand. They were not in the earliest versions of B, but appeared along the way. People often guess that they were created to use the auto-increment and auto-decrement address modes provided by the DEC PDP-11, on which C and Unix first became popular. This is historically impossible, inasmuch as there was no PDP-11 when B was developed. The PDP-7, however, did have a few “auto-increment” memory cells, with the property that an indirect memory reference through them incremented the cell. This feature probably suggested such operators to Thompson; the generalization to make them both prefix and postfix was his own. Indeed, the auto-increment cells were not used directly in implementation of the operators, and a stronger motivation for the innovation was probably his observation that the translation of ++x was smaller than that of x=x+l.”
The development of the C programming language, Dennis Ritchie - https://dl.acm.org/doi/10.1145/234286.1057834
This is good, but the problem is that Ritchie is just guessing about how Thompson was inspired to invent the
++
and--
operators. The PDP-7 and PDP-11 were not the only computers to have auto-increment and auto-decrement addressing modes. It was common feature. Thompson worked on the Multics project before creating “Unics” on the PDP-7 (later renamed to Unix). Multics originally ran on the GE-645 computer. This was a 36 bit computer with auto-increment and auto-decrement addressing modes.