Programmers have a long and rich history with C, and that history has taught us many lessons. The chief lesson from that history must surely be that human beings, demonstrably, cannot write C code which is reliably safe over time. So I hope nobody says C is simple! It’s akin to assembly, appropriate as a compilation target, not as an implementation language except in extreme circumstances.
Which human beings?
Did history also teach us that operating a scalpel on human flesh cannot be done reliably safe over time?
Perhaps the lesson is that the barrier of entry for an engineering job was way higher 40 years ago. If you would admit surgeons to a hospital after a “become a gutt-slicer in four weeks” program, I don’t think I need to detail what the result would be.
There’s nothing wrong with C, just like there’s nothing wrong with a scalpel. We might have more appropriate tools for some of its typical applications, but iC s still a proven useful tool.
Those who think their security burns will be solved by a gimmick such as changing programming language, are in for a very unpleasant surprise.
Perhaps the lesson is that the barrier of entry for an engineering job was way higher 40 years ago
Given the number of memory safety bugs that have been found in 40-year-old code, I doubt it. The late ‘90s and early 2000s exposed a load of these bugs because this C code written by skilled engineers was exposed to a network full of malicious individuals for the first time. In the CHERI project, we’ve found memory safety bugs in code going back to the original UNIX releases. The idea that there was some mythical time in the past when programmers were real men who never introduced security bugs is just plain wrong. It’s also a weird attitude: a good work an doesn’t blame his tools because a good work an chooses good tools. Given a choice between a tool that can be easily operated to produce good results and one that, if used incredibly carefully, might achieve the same results, it’s not a sign of a good engineer to choose the latter.
Given the number of memory safety bugs that have been found in 40-year-old code, I doubt it.
Back then, the C programmers didn’t know about memory safety bugs and the kind of vulnerabilities we have since two decades. Similar, Javascript and HTML are surely two programming languages which are somewhat easier to write than C and doesn’t suffer from the same class of vulnerabilities. However, 20 years ago people wrote code in these two languages that suffer from XSS and other web based vulns. Heck, XSS and SQLi is still a thing nowadays.
What I like about C is that it forces the programmer to understand the OS below. Writing C without knowing about memory management, file descriptors, processes is doomed to fail. And this is what I miss today and maybe @pm in their comment hinted at. I conduct job interviews with people who consider themself senior and they only know the language and have little knowledge about the environment they’re working in.
Yes, and what we have now is a vast trove of projects written by very smart programmers, who do know the OS (and frequently work on it), and do know how CPUs work, and do know about memory safety problems, and yet still cannot avoid writing code that has bugs in it, and those bugs are subsequently exploitable.
Knowing how the hardware, OS (kernel and userspace), and programming language work is critical for safety or you will immediately screw up, rather than it being an eventual error.
People fail to understand that the prevalence of C/C++ and other memory unsafe languages has a massive performance cost: ASLR, Stack and heap canaries, etc and then in hardware: PAC, CFI, MTE, etc all have huge performance costs in modern hardware, are all necessary solely due to the need for the platform to mitigate the terrible safety of the code being run. That’s now all sunk cost of course: if you magically shifted all code today to something that was memory safe, the ASLR and various canaries costs would still be there - if you were super confident your OS could turn ASLR off, and you could compile canary free, but the underlying hardware is permanently stuck with those costs.
Forcing the programmer to understand the OS below could (and can) happen languages other than C. The main reason it doesn’t happen is that OS APIs, while being powerful, are also sharp objects that are easy to get wrong (I’ve fixed bugs in Janet at the OS/API level, I have a little experience there), so many languages that are higher level end up with wrappers that help encode assumptions that need to not be violated.
But, a lot of those low level functions are simply the bottom layer for userland code, rather than being The Best Possible Solution as such.
Not to say that low level APIs are necessarily bad, but given the stability requirements, they accumulate cruft.
The programmer and project that I have sometimes used as a point of comparison is more recent. I’m now about the same age that Richard Hipp was when he was doing his early work on SQLite. I admire him for writing SQLite from scratch in very portable C; the “from scratch” part enabled him to make it public domain, thus eliminating all (or at least most) legal barriers to adoption. And as I mentioned, it’s very portable, certainly more portable than Rust at this point (my current main open-source project is in Rust), though I suppose C++ comes pretty close.
Do you have any data on memory safety bugs in SQLite? I especially wonder how prone it was to memory safety bugs before TH3 was developed.
Did history also teach us that operating a scalpel on human flesh cannot be done reliably safe over time?
I think it did. It’s just that the alternative (not doing it) is generally much much worse.
There’s nothing wrong with C, just like there’s nothing wrong with a scalpel.
There is no alternative to the scalpel (well, except there is in many circumstances and we do use them). But there can be alternatives to C. And I say that as someone who chose to write a new cryptographic library 5 years ago in C, because that was the only way I could achieve the portability I wanted.
C does have quite a few problems, many of which could be solved with a pre-processor similar to CFront. The grammar isn’t truly context free, the syntax has a number of quirks we have since learned to steer clear from. switch falls though by default. Macros are textual instead of acting at the AST level. Everything is mutable by default. It is all too easy to read uninitialised memory. Cleanup could use some more automation, either with defer or destructors. Not sure about generics, but we need easy to use ones. There is enough undefined behaviour that we have to treat compilers like sentient adversaries now.
When used very carefully, with a stellar test suite and sanitisers all over the place, C is good enough for many things. It’s also the best I have in some circumstances. But it’s far from the end game even in its own turf. We can do better.
And I say that as someone who chose to write a new cryptographic library 5 years ago in C, because that was the only way I could achieve the portability I wanted.
I was wondering why the repo owner seemed so familiar!
Those who think their security burns will be solved by a gimmick such as changing programming language, are in for a very unpleasant surprise.
I don’t think that moving from a language that e.g. permits arbitrary pointer arithmetic, or memory copy operations without bounds checking, to a language that disallows these things by construction, can be reasonably characterized as a gimmick.
There’s nothing wrong with C, just like there’s nothing wrong with a scalpel.
This isn’t a great analogy, but let’s roll with it. I think it’s uncontroversial to say that neither C nor scalpels can be used at a macro scale without significant (and avoidable) negative outcomes. I don’t know if that means there is something wrong with them, but I do know that it means nobody should be reaching for them as a general or default way to solve a given problem. Relatively few problems of the human body demand a scalpel; relatively few problems in computation demand C.
What we would consider “modern” surgery had a low success rate, and a high straight up fatality rate.
If we are super generous, let’s say C is a scalpel. In that case we can look at the past and see a great many deaths were caused by people using a scalpel, long after it was established that there was a significant differences in morbidity when comparing a scalpel, to a sterilized scalpel.
What we have currently is a world where we have C (and similar), which will work significantly better than all the tools the preceded it, but is also very clearly less safe than any modern safe language.
Lots of the basic ones seem like they fall out of a simple implementation. They usually act as you would expect if you implemented the least effort compiler you could.
eg, the example about scoping:
int f() {
int x = 3;
{
extern int x;
return x;
}
}
Works exactly as you would expect if you realize the innermost extern int x isn’t changing scope to the outermost x, but is declaring a new x that should be filled in by the linker. Scoping is acting as you expect. Extern just means “fill in at link time from a global declared elsewhere”. Remove the global x, declare it in a different file, and you get the same behavior.
The GCC extensions are very poorly thought out, though. It seems like many might have come about as a quick hack without thinking about the implications of actual use, or how they would behave on other compilers.
I’m only slightly joking here, but: it’s likely that you could just write a “compiler” that, no matter what the input, always outputs the same do-nothing-and-exit binary. Since, after all, the standard interpretation is that undefined behavior allows the compiler to do anything, and it’s basically impossible to write a non-trivial C program (or even many trivial ones – lots of stdlib stuff is technically UB because of memory allocations) without UB.
If you’re going on technicalities, the stdlib is technically a part of the C standard, and thus any undefined behavior in the implementation of the function is “below the level of the spec”, in a manner of speaking. There’s no UB, because the library is defined in the standard.
Now, if you use posix libraries… well, posix C is a superset of ISO C, defining things like the conversion between function pointers and void pointers, or the size of a byte.
This is where we end up in one of those Stack Overflow posts where someone asks if some standard libraryfunction allocates memory, and the answer is “the standard doesn’t explicitly state that it has to, but implementing it in a conformant way basically can’t be done without allocating”.
One example for instance is memmove(3): it needs somehow to test whether the input and output buffers overlap, or at least to test which one buffer is located before the other. Problem is, comparing two pointers is undefined if they do not point to the same object (or 1 slot after). You can compare two pointers pointing to various locations of the same buffer, but for two unrelated buffers that’s illegal (probably because of segmented memory models like we had with DOS). And it’s unclear (at least under C11, they may have fixed it later) whether converting the pointers to an integer type first makes it actually legal. I know the TIS interpreter doesn’t like it.
Well, as orib stated, any undefined behavior in stdlib is “below the level of the spec” because there are a few functions, such as memmove() as mentioned, and offsetof() and even setjmp()/longjmp() that can’t be implemented in standard C. As P. J. Plauger says in The Standard C Library:
That leaves the macro offsetof. You use it to determine the offset in bytes of a structure member from the start of the structure. Standard C defines no portable way to write this macro. Each implementation, however, must have some nonstandard way to implement it. An implementation may, for example, reliably evaluate some expression whose behavior is undefined in the C Standard.
You can look on offsetof as a portable way to perform a nonportable operation. That is true of many macros and type definitions in the Standard C library.
Correct, using stdlib is not undefined. Still, if implementing requires undefined behaviour, that feels unclean. I like to keep my language and its standard library separate. If something in the standard library can’t be implemented in terms of the language, expand the language until it can.
You could argue it’s all semantics, but this separation keeps us honest. If something in the stdlib can’t actually be implemented in terms of the language alone, then I’m lying about the actual size and complexity of my language: what looks like a part of the standard library is actually part of the language. And I like my languages small.
Even a basic “hello world” that calls printf() probably involves UB if you dig into how your printf is implemented.
Compiler authors have pushed for the most expansive possible definitions of UB, and as a result it is everywhere. Hence my “joking but not really” about how you can just assume any given C program contains UB and “optimize” it away to a no-op.
Our industry (and English in general) uses “simple” in different meanings that make it ambiguous, nearly useless term:
“Simple” could mean easy to use, but making something easy to use may be a very difficult task requiring great implementation effort and complex methods.
“Simple” could mean consisting of very few parts, primitive. Things that are simple in this way may be tedious to use or not sophisticated enough to tackle complex advanced tasks.
And then “C” is not a single thing either, so it’s simple (in various meanings) and complex at the same time, depending on how you look at it.
The PDP-11 compiler was very basic, but GCC and LLVM are millions of lines of complex code that took decades to develop. The basics of “C” that students learn is only scratching the surface of the full specification, and the spec covers only a fragment of the complexity of real-world C software projects including the platforms and tooling they have to deal with.
If you create a new operating system, and there is no C/C++ compiler available, then the majority of free / open source software can’t be compiled and run, because C/C++ are the foundation for everything we use. You will have no web browser, no shell, no Python, no Rust, no Javascript, no Ruby, and so on. C is also the foundation for interoperability between programming languages. If Haskell code wants to interoperate with Rust code at the function call level, it is done via C.
Programmers have a long and rich history with C, and that history has taught us many lessons. The chief lesson from that history must surely be that human beings, demonstrably, cannot write C code which is reliably safe over time. So I hope nobody says C is simple! It’s akin to assembly, appropriate as a compilation target, not as an implementation language except in extreme circumstances.
Which human beings? Did history also teach us that operating a scalpel on human flesh cannot be done reliably safe over time?
Perhaps the lesson is that the barrier of entry for an engineering job was way higher 40 years ago. If you would admit surgeons to a hospital after a “become a gutt-slicer in four weeks” program, I don’t think I need to detail what the result would be.
There’s nothing wrong with C, just like there’s nothing wrong with a scalpel. We might have more appropriate tools for some of its typical applications, but iC s still a proven useful tool.
Those who think their security burns will be solved by a gimmick such as changing programming language, are in for a very unpleasant surprise.
Given the number of memory safety bugs that have been found in 40-year-old code, I doubt it. The late ‘90s and early 2000s exposed a load of these bugs because this C code written by skilled engineers was exposed to a network full of malicious individuals for the first time. In the CHERI project, we’ve found memory safety bugs in code going back to the original UNIX releases. The idea that there was some mythical time in the past when programmers were real men who never introduced security bugs is just plain wrong. It’s also a weird attitude: a good work an doesn’t blame his tools because a good work an chooses good tools. Given a choice between a tool that can be easily operated to produce good results and one that, if used incredibly carefully, might achieve the same results, it’s not a sign of a good engineer to choose the latter.
Back then, the C programmers didn’t know about memory safety bugs and the kind of vulnerabilities we have since two decades. Similar, Javascript and HTML are surely two programming languages which are somewhat easier to write than C and doesn’t suffer from the same class of vulnerabilities. However, 20 years ago people wrote code in these two languages that suffer from XSS and other web based vulns. Heck, XSS and SQLi is still a thing nowadays.
What I like about C is that it forces the programmer to understand the OS below. Writing C without knowing about memory management, file descriptors, processes is doomed to fail. And this is what I miss today and maybe @pm in their comment hinted at. I conduct job interviews with people who consider themself senior and they only know the language and have little knowledge about the environment they’re working in.
Yes, and what we have now is a vast trove of projects written by very smart programmers, who do know the OS (and frequently work on it), and do know how CPUs work, and do know about memory safety problems, and yet still cannot avoid writing code that has bugs in it, and those bugs are subsequently exploitable.
Knowing how the hardware, OS (kernel and userspace), and programming language work is critical for safety or you will immediately screw up, rather than it being an eventual error.
People fail to understand that the prevalence of C/C++ and other memory unsafe languages has a massive performance cost: ASLR, Stack and heap canaries, etc and then in hardware: PAC, CFI, MTE, etc all have huge performance costs in modern hardware, are all necessary solely due to the need for the platform to mitigate the terrible safety of the code being run. That’s now all sunk cost of course: if you magically shifted all code today to something that was memory safe, the ASLR and various canaries costs would still be there - if you were super confident your OS could turn ASLR off, and you could compile canary free, but the underlying hardware is permanently stuck with those costs.
Forcing the programmer to understand the OS below could (and can) happen languages other than C. The main reason it doesn’t happen is that OS APIs, while being powerful, are also sharp objects that are easy to get wrong (I’ve fixed bugs in Janet at the OS/API level, I have a little experience there), so many languages that are higher level end up with wrappers that help encode assumptions that need to not be violated.
But, a lot of those low level functions are simply the bottom layer for userland code, rather than being The Best Possible Solution as such.
Not to say that low level APIs are necessarily bad, but given the stability requirements, they accumulate cruft.
The programmer and project that I have sometimes used as a point of comparison is more recent. I’m now about the same age that Richard Hipp was when he was doing his early work on SQLite. I admire him for writing SQLite from scratch in very portable C; the “from scratch” part enabled him to make it public domain, thus eliminating all (or at least most) legal barriers to adoption. And as I mentioned, it’s very portable, certainly more portable than Rust at this point (my current main open-source project is in Rust), though I suppose C++ comes pretty close.
Do you have any data on memory safety bugs in SQLite? I especially wonder how prone it was to memory safety bugs before TH3 was developed.
I think it did. It’s just that the alternative (not doing it) is generally much much worse.
There is no alternative to the scalpel (well, except there is in many circumstances and we do use them). But there can be alternatives to C. And I say that as someone who chose to write a new cryptographic library 5 years ago in C, because that was the only way I could achieve the portability I wanted.
C does have quite a few problems, many of which could be solved with a pre-processor similar to CFront. The grammar isn’t truly context free, the syntax has a number of quirks we have since learned to steer clear from.
switch
falls though by default. Macros are textual instead of acting at the AST level. Everything is mutable by default. It is all too easy to read uninitialised memory. Cleanup could use some more automation, either withdefer
or destructors. Not sure about generics, but we need easy to use ones. There is enough undefined behaviour that we have to treat compilers like sentient adversaries now.When used very carefully, with a stellar test suite and sanitisers all over the place, C is good enough for many things. It’s also the best I have in some circumstances. But it’s far from the end game even in its own turf. We can do better.
I was wondering why the repo owner seemed so familiar!
I don’t think that moving from a language that e.g. permits arbitrary pointer arithmetic, or memory copy operations without bounds checking, to a language that disallows these things by construction, can be reasonably characterized as a gimmick.
This isn’t a great analogy, but let’s roll with it. I think it’s uncontroversial to say that neither C nor scalpels can be used at a macro scale without significant (and avoidable) negative outcomes. I don’t know if that means there is something wrong with them, but I do know that it means nobody should be reaching for them as a general or default way to solve a given problem. Relatively few problems of the human body demand a scalpel; relatively few problems in computation demand C.
That’s a poor analogy.
What we would consider “modern” surgery had a low success rate, and a high straight up fatality rate.
If we are super generous, let’s say C is a scalpel. In that case we can look at the past and see a great many deaths were caused by people using a scalpel, long after it was established that there was a significant differences in morbidity when comparing a scalpel, to a sterilized scalpel.
What we have currently is a world where we have C (and similar), which will work significantly better than all the tools the preceded it, but is also very clearly less safe than any modern safe language.
Lots of the basic ones seem like they fall out of a simple implementation. They usually act as you would expect if you implemented the least effort compiler you could.
eg, the example about scoping:
Works exactly as you would expect if you realize the innermost
extern int x
isn’t changing scope to the outermost x, but is declaring a new x that should be filled in by the linker. Scoping is acting as you expect. Extern just means “fill in at link time from a global declared elsewhere”. Remove the global x, declare it in a different file, and you get the same behavior.The GCC extensions are very poorly thought out, though. It seems like many might have come about as a quick hack without thinking about the implications of actual use, or how they would behave on other compilers.
I’m only slightly joking here, but: it’s likely that you could just write a “compiler” that, no matter what the input, always outputs the same do-nothing-and-exit binary. Since, after all, the standard interpretation is that undefined behavior allows the compiler to do anything, and it’s basically impossible to write a non-trivial C program (or even many trivial ones – lots of stdlib stuff is technically UB because of memory allocations) without UB.
If you’re going on technicalities, the stdlib is technically a part of the C standard, and thus any undefined behavior in the implementation of the function is “below the level of the spec”, in a manner of speaking. There’s no UB, because the library is defined in the standard.
Now, if you use posix libraries… well, posix C is a superset of ISO C, defining things like the conversion between function pointers and void pointers, or the size of a byte.
This is where we end up in one of those Stack Overflow posts where someone asks if some standard libraryfunction allocates memory, and the answer is “the standard doesn’t explicitly state that it has to, but implementing it in a conformant way basically can’t be done without allocating”.
There are only three functions I know of in standard C that allocate memory,
malloc()
,calloc()
andrealloc()
. What standard functions have UB?I said “stdlib”, not “standard C”.
A stupendous number of functions in typical implementations have UB.
Well, stdlib IS standard C. So you’re saying that many implementations of the C standard library have UB?
That’s what they’re saying.
One example for instance is
memmove(3)
: it needs somehow to test whether the input and output buffers overlap, or at least to test which one buffer is located before the other. Problem is, comparing two pointers is undefined if they do not point to the same object (or 1 slot after). You can compare two pointers pointing to various locations of the same buffer, but for two unrelated buffers that’s illegal (probably because of segmented memory models like we had with DOS). And it’s unclear (at least under C11, they may have fixed it later) whether converting the pointers to an integer type first makes it actually legal. I know the TIS interpreter doesn’t like it.Well, as orib stated, any undefined behavior in stdlib is “below the level of the spec” because there are a few functions, such as
memmove()
as mentioned, andoffsetof()
and evensetjmp()/longjmp()
that can’t be implemented in standard C. As P. J. Plauger says in The Standard C Library:Correct, using stdlib is not undefined. Still, if implementing requires undefined behaviour, that feels unclean. I like to keep my language and its standard library separate. If something in the standard library can’t be implemented in terms of the language, expand the language until it can.
You could argue it’s all semantics, but this separation keeps us honest. If something in the stdlib can’t actually be implemented in terms of the language alone, then I’m lying about the actual size and complexity of my language: what looks like a part of the standard library is actually part of the language. And I like my languages small.
Yes!
Even a basic “hello world” that calls
printf()
probably involves UB if you dig into how yourprintf
is implemented.Compiler authors have pushed for the most expansive possible definitions of UB, and as a result it is everywhere. Hence my “joking but not really” about how you can just assume any given C program contains UB and “optimize” it away to a no-op.
I believe C has a reputation of being simple because it’s usually compared to C++ which is a few order of magnitudes more complex.
The section of the C standard that covers the language description has 11 sub-sections and is over 130 pages long. Doesn’t sound simple to me.
Our industry (and English in general) uses “simple” in different meanings that make it ambiguous, nearly useless term:
“Simple” could mean easy to use, but making something easy to use may be a very difficult task requiring great implementation effort and complex methods.
“Simple” could mean consisting of very few parts, primitive. Things that are simple in this way may be tedious to use or not sophisticated enough to tackle complex advanced tasks.
And then “C” is not a single thing either, so it’s simple (in various meanings) and complex at the same time, depending on how you look at it.
The PDP-11 compiler was very basic, but GCC and LLVM are millions of lines of complex code that took decades to develop. The basics of “C” that students learn is only scratching the surface of the full specification, and the spec covers only a fragment of the complexity of real-world C software projects including the platforms and tooling they have to deal with.
Yikes.
And it’s still dominant for turning up new platforms, just because existing compilers are so easy?
If you create a new operating system, and there is no C/C++ compiler available, then the majority of free / open source software can’t be compiled and run, because C/C++ are the foundation for everything we use. You will have no web browser, no shell, no Python, no Rust, no Javascript, no Ruby, and so on. C is also the foundation for interoperability between programming languages. If Haskell code wants to interoperate with Rust code at the function call level, it is done via C.
Simple is not necessarily the same as easy to read or understand. Brainfuck is very simple indeed, but I wouldn’t claim it’s easy…