This feels like something you learn on day 1 (week 1?) as a C programmer, assuming you compile with all warnings enabled, when you try to printf a size_t value:
warning: format '%d' expects argument of type 'int', but argument has type 'long unsigned int
Most people just cast it to int to make the compiler shut up. Teachers don’t really specify the importance of using the right types for things (and a lot of them do use %d to print size_t grumble grumble) and students just want to get it working ASAP. It’s an issue with how we do teaching really.
I was so annoyed by the premise (most people do not even have a computer where sizeof size_t == sizeof int, it’s 2023, come on!), that I just hid this article, but then it’s a very valid rant about crazy arithmetics in C. Rust is already worth using just for fixing this.
I generally get annoyed when someone randomly throws Rust as a substitute for everything that is seemingly wrong with C. I believe each language has its place in the overall scheme of things, be it C or Rust or C++ or Haskell. Trying to solve every problem with Rust may not be the ideal approach I’d be willing to espouse.
If one can set aside that it’s just Theo doing his garden variety shouting at things he didn’t make, the actual things he’s shouting are also not especially true or are now out of date. That’s when I realised this mail is from six years ago.
Lots of languages have a place in the future, but it is not true that each language does. Some technologies become outdated for real reasons, and people eventually stop using them.
I don’t know why you would start a new project in a language with such a confusion about types and with so few safety features as C in 2023, unless you had an especially baroque requirement. Even then, I don’t know that it wouldn’t be worth trying to spend some of your budget trying to bend Rust or something like Rust to fit anyway. You’re going to pay a lot of hidden costs in using C, you just won’t get to choose when those costs will come due; they’ll happen in the form of critical memory safety thread safety bugs down the line, often after deployment and to the detriment of economic or other value.
I thought the ILP64 model was a bit more common, but the only thing Wikipedia knows is using it is the HAL Computer Systems port of Solaris to the SPARC64.
The article brings up an important issue, but this conclusion is over-reaching and isn’t justified:
Therefore, we must conclude that size_t is the only correct type for all sizes, lengths, object counts, and indices/subscripts.
The problem is that to avoid casts all over the place, a C codebase has to make one of two choices:
Used signed as much as possible
Use unsigned as much as possible (and of course size_t is unsigned)
Note that you can’t really be 100% in either direction because the standard library and language are inconsistent, but it’s best to lean one way or the other.
Casting should be minimized because they’re often about “shutting up” a warning, don’t increase correctness, and can CAUSE security issues.
The second policy has the disadvantage that it can’t represent negative numbers, which are useful!
So Google chooses the first discipline, which pretty directly contradicts this part of the blog post. I believe the Google advice is good, and in practice it leads to more secure code.
The author is talking about corner cases and doesn’t present measurements of real code, or experience over a long time period. Simply using size_t everywhere doesn’t reduce those corner cases globally.
I’m not saying the “unsigned discipline” is necessarily worse (I’ve seen arguments both ways), only that the post doesn’t justify it.
A philosophical mistake the post is thinking that using C or C++ types in a certain way will make your program correct in a certain sense, like Haskell types. They simply won’t. You have to test your code and instrument it dynamically. That’s the state of the art.
In other words, there is no such C or C++ codebase which simply uses types in a certain way and is therefore free of integer overflow and buffer overflow.
The language fundamentally can’t express those notions statically. (Rust obviously does way better, but it can’t represent lack of integer overflow statically either. Not sure if they have semantics to prevent integer overflow from turning into buffer overflow?)
Stephen Kell has a good paper that explains the historical difference between C types and ML-style types:
C types were for instruction selection, not correctness (explains why enums are ints, arrays decay to pointers, why the default return type is int, etc.). C++ inherits a lot of that legacy too.
The Google style guide says to use runtime assertions instead, and that along with other dynamic techniques like unit testing, sanitizers, and fuzzing are what work in practice to reduce this class of bugs.
Of the built-in C++ integer types, the only one used is int. If a program needs a variable of a different size, use a precise-width integer type from , such as int16_t. If your variable represents a value that could ever be greater than or equal to 2^31 (2GiB), use a 64-bit type such as int64_t. Keep in mind that even if your value won’t ever be too large for an int, it may be used in intermediate calculations which may require a larger type. When in doubt, choose a larger type.
The standard library header defines types like int16_t, uint32_t, int64_t, etc. You should always use those in preference to short, unsigned long long and the like, when you need a guarantee on the size of an integer. Of the C integer types, only int should be used. When appropriate, you are welcome to use standard types like size_t and ptrdiff_t.
We use int very often, for integers we know are not going to be too big, e.g., loop counters. Use plain old int for such things. You should assume that an int is at least 32 bits, but don’t assume that it has more than 32 bits. If you need a 64-bit integer type, use int64_t or uint64_t.
For integers we know can be “big”, use int64_t.
You should not use the unsigned integer types such as uint32_t, unless there is a valid reason such as representing a bit pattern rather than a number, or you need defined overflow modulo 2^N. In particular, do not use unsigned types to say a number will never be negative. Instead, use assertions for this.
If your code is a container that returns a size, be sure to use a type that will accommodate any possible usage of your container. When in doubt, use a larger type rather than a smaller type.
Use care when converting integer types. Integer conversions and promotions can cause undefined behavior, leading to security bugs and other problems.
Part of the problem is that C (and therefore C++) seem hell bent on not having a bunch of core types (size_t, ptrdiff_t, intptr_t, int32_t, int64_t, etc) be part of the language, and instead require including headers. Because of that when people are trying to avoid including essentially unrelated headers in other headers they end up taking random you get a scatter shot approach to what the correct type is.
This feels like something you learn on day 1 (week 1?) as a C programmer, assuming you compile with all warnings enabled, when you try to printf a size_t value:
Most people just cast it to int to make the compiler shut up. Teachers don’t really specify the importance of using the right types for things (and a lot of them do use %d to print size_t grumble grumble) and students just want to get it working ASAP. It’s an issue with how we do teaching really.
I was so annoyed by the premise (most people do not even have a computer where
sizeof size_t == sizeof int
, it’s 2023, come on!), that I just hid this article, but then it’s a very valid rant about crazy arithmetics in C. Rust is already worth using just for fixing this.I generally get annoyed when someone randomly throws Rust as a substitute for everything that is seemingly wrong with C. I believe each language has its place in the overall scheme of things, be it C or Rust or C++ or Haskell. Trying to solve every problem with Rust may not be the ideal approach I’d be willing to espouse.
I posted this thread on a separate discussion, but I think it is worth pondering over: https://marc.info/?l=openbsd-misc&m=151233345723889&w=2 (the entire discussion is worth reading, IMO).
If one can set aside that it’s just Theo doing his garden variety shouting at things he didn’t make, the actual things he’s shouting are also not especially true or are now out of date. That’s when I realised this mail is from six years ago.
Lots of languages have a place in the future, but it is not true that each language does. Some technologies become outdated for real reasons, and people eventually stop using them.
I don’t know why you would start a new project in a language with such a confusion about types and with so few safety features as C in 2023, unless you had an especially baroque requirement. Even then, I don’t know that it wouldn’t be worth trying to spend some of your budget trying to bend Rust or something like Rust to fit anyway. You’re going to pay a lot of hidden costs in using C, you just won’t get to choose when those costs will come due; they’ll happen in the form of critical memory safety thread safety bugs down the line, often after deployment and to the detriment of economic or other value.
I thought the ILP64 model was a bit more common, but the only thing Wikipedia knows is using it is the HAL Computer Systems port of Solaris to the SPARC64.
The article brings up an important issue, but this conclusion is over-reaching and isn’t justified:
The problem is that to avoid casts all over the place, a C codebase has to make one of two choices:
Note that you can’t really be 100% in either direction because the standard library and language are inconsistent, but it’s best to lean one way or the other.
Casting should be minimized because they’re often about “shutting up” a warning, don’t increase correctness, and can CAUSE security issues.
The second policy has the disadvantage that it can’t represent negative numbers, which are useful!
So Google chooses the first discipline, which pretty directly contradicts this part of the blog post. I believe the Google advice is good, and in practice it leads to more secure code.
The author is talking about corner cases and doesn’t present measurements of real code, or experience over a long time period. Simply using size_t everywhere doesn’t reduce those corner cases globally.
I’m not saying the “unsigned discipline” is necessarily worse (I’ve seen arguments both ways), only that the post doesn’t justify it.
A philosophical mistake the post is thinking that using C or C++ types in a certain way will make your program correct in a certain sense, like Haskell types. They simply won’t. You have to test your code and instrument it dynamically. That’s the state of the art.
In other words, there is no such C or C++ codebase which simply uses types in a certain way and is therefore free of integer overflow and buffer overflow.
The language fundamentally can’t express those notions statically. (Rust obviously does way better, but it can’t represent lack of integer overflow statically either. Not sure if they have semantics to prevent integer overflow from turning into buffer overflow?)
Stephen Kell has a good paper that explains the historical difference between C types and ML-style types:
https://www.cs.tufts.edu/~nr/cs257/archive/stephen-kell/in-search-of-types.pdf
C types were for instruction selection, not correctness (explains why enums are ints, arrays decay to pointers, why the default return type is int, etc.). C++ inherits a lot of that legacy too.
The Google style guide says to use runtime assertions instead, and that along with other dynamic techniques like unit testing, sanitizers, and fuzzing are what work in practice to reduce this class of bugs.
https://google.github.io/styleguide/cppguide.html#Integer_Types
Part of the problem is that C (and therefore C++) seem hell bent on not having a bunch of core types (size_t, ptrdiff_t, intptr_t, int32_t, int64_t, etc) be part of the language, and instead require including headers. Because of that when people are trying to avoid including essentially unrelated headers in other headers they end up taking random you get a scatter shot approach to what the correct type is.