where even though everything compiled correctly, it didn’t work. For a compiled language, that is not something you expect
I would never expect that of C++! Maybe Haskell or Rust, but not C++.
C++ has a ton of holes derived from C, and a whole bunch of new features that can be confused. Here are a couple of surprises I encountered in semi-automatically translating Oil to C++ (even though I’ve been using both C and C++ for decades):
bools and pointers can be easily confused. If you have void f(bool b) then the call f(&myint) is valid. Ditto for return values. I think this same issue caused a severe GMail outage a long time ago.
A base class and an inherited class can both have a field x, which is pretty much never what you want:
This is in addition to all the “usual” ones like scope/shadowing, uninitialized variables (usually warned, but not always), leaving off braces like goto fail, unexpected wrapping with small integer types, signed/unsigned problems, dangling pointers, buffer overflows, use after free, etc.
string_view is nice but it’s also a “pointer” and can dangle. Those are all reasons that code may not work when it compiles.
I think leaning on the compiler too much in C++ gives diminishing returns. It encourages a style that bloats your compile times while providing limited guarantees. With C that’s even more true since I consider it more of a dynamic language (e.g. void* is idiomatic).
Historically, C was even more dynamically typed than it is now. Types were only for instruction selection, e.g. a + b for 2 ints generated different code than 2 floats. That’s about it. You didn’t have to declare function return types or parameter types – they’re assumed to be ints. Reading the old Thompson and Ritchie code really underscores this.
C++ has more of the philosophy of types for correctness, but it was constrained by compatibility with C in many cases. It comes from a totally different tradition and mindset than say Haskell or ML.
I would never expect that of C++! Maybe Haskell or Rust, but not C++.
I am somewhat hesitant to say this about Rust or Haskell, even in jest - it’s a best an aspirational aphorism about code in these languages, and if you’re trying to think seriously about program correctness it matters that it’s very possible to write code in Rust or Haskell that compiles but is not correct (for some definition of correct). If you want to write code that you can prove is correct at compile time, that’s a noble goal and you need more sophisticated tools for doing this than the ones Haskell or Rust give you.
Such generalizations are never true in the absolute sense, but there is a noticeable difference in how often and how correct programs are when they compile for the first time in Rust vs less strict languages.
Rust does a remarkable job eliminating “boring” language-level problems, like unexpected nulls, unhandled errors, use-after-free, and unsynchronized data shared between threads. These things most of the time just work in Rust on the first try. In C++ kinda maybe if you’re programming with NASA-level of diligence, but typically I’d expect compiling in C++ to be just the first step before testing and debugging to weed these problems out.
I don’t think it’s a binary as much as it’s far as the language’s guarantees on compile-time safety through things like the type system or borrow checker making it more likely that if it compiles, it’s correct.
Yeah honestly I don’t really believe in that whole philosophy – I feel like it leads you into a Turing tarpit of a type system. There are plenty of other engineering tools besides type systems that need more love.
But I think that refactoring can be quite safe in strongly typed languages, and that’s useful. Writing new code isn’t really because you don’t know what you want yet, and you can have logic bugs. But refactoring can be, and that’s what the article is about.
It is true with regard to a property called parametricity. On an intuitive level it states that type parameters are used as expected. So a function map :: (a -> b) -> [a] -> [b] must satisfy that each element in [b], it must have an image wrt f in [a] (note that you could as well return the empty list for each input and it would typecheck, thus our guarantee is worded a bit strangely).
It is very often correct though. Usually what we say is that if you understand the problem and your solution compiles it probably works. If it doesn’t work you likely don’t understand the problem.
You experience this programming in Haskell more often than Rust (I think because of HKTs) but it is still often the case in Rust.
All of their uses of const std::string & look like things that should be std::string_view in modern C++. I’m curious why they refactored their code to use C++98 idioms in 2021.
std::string_view does not
guarantee a null-terminated string, std::string does. We have to use a
few C libraries, so constructing a temporary string each time instead of
using a const reference to a .c_str() (to get a const char*)
would require more changes and thus more testing,
whereas we tried to keep this refactoring change as small and scoped to
as possible, not changing behavior if not absolutely required. That will
come in a next round of refactoring. Go read that part on the
Big Redesign In The Sky for our refactoring workflow.
Why are we doing this right now and not way earlier? We only just got an
updated compiler for the specific hardware we use that supports modern
C++ 17, before that we had a half-baked C++ 11 with big parts either missing
or not finished. Now we have a newer compiler, thus we can take advantage
of newer features.
Note that std::string, like std::string_view, may contain null characters in the middle of the string. It’s therefore quite dangerous to rely on c_str() in any string that may contain attacker-controlled data: anything doing comparisons on the std::string will see the null byte, anything looking at the C string will see only the characters up to the first null.
If I were doing this refactoring, I’d start by pulling out the operations that folks are doing on C strings and turning them into things that took a pair of CharT iterators, then incrementally rewrite the code that takes C strings to use the new versions, then switch from const char* to std::string_view. LLVM actually went through this exact refactoring about a decade ago, when we still built with a C++98 compiler, so the lack of C++17 support wasn’t an obstacle. The flow was:
Introducing llvm::StringRef (before std::string_view existed), as a generic wrapper around some non-owned contiguous range of characters.
Introducing operations that we needed on StringRef, including adding an llvm::Twine type for intermediate results of string modifications. A twine is sequence of string refs, so you can cheaply concatenate a bunch of strings in a Twine and do a single allocation for the final result if you need it in a contiguous buffer (often you don’t - many of the common operations on strings are defined for twines).
Refactor APIs to use StringRef instead of const std::string &, which meant that the same API could be called with a std::string or a char* (or anything else that can construct a StringRef, including a substring view from the middle of another buffer).
I’m confused because you seem to be, in 2021, refactoring your codebase to look like ours was at the start of our refactoring, around 2008.
Also, string_view is annoyingly limited compared to string. Ideally every read-only operation on string should be available on string_view, but in reality it seems the C++ committee lost interest along the way and only added the most common ones.
I had high hopes for refactoring to use string_view, but I kept running into places where I had to create a temporary string for no reason except to get to some const method.
(Sorry I can’t list the missing methods here; I don’t remember them offhand and I’m not near a compiler right now.)
I believe that this is because of a belief that string and string_view should represent storage with some guarantees on the contents. Operations on strings live in <algorithm> so that other string representations can be added easily.
I’m curious as to why Base::setName is not pure virtual. That would mean that if you forgot to override it you’d get a compile-time error. Instead of the current situation where if you forget to override it (or attempt to override it but make a mistake) you don’t find out until you get an assertion failure at run-time.
Oh, and GCC also supports -Woverloaded-virtual for anybody who would like to have this warning but isn’t using clang.
I can’t go back far enough in our repository to see if there is a reason for not being an abstract base class, and I cannot open the previous version control system from here (visual sourcesafe) to look back even further.
I would never expect that of C++! Maybe Haskell or Rust, but not C++.
C++ has a ton of holes derived from C, and a whole bunch of new features that can be confused. Here are a couple of surprises I encountered in semi-automatically translating Oil to C++ (even though I’ve been using both C and C++ for decades):
void f(bool b)
then the callf(&myint)
is valid. Ditto for return values. I think this same issue caused a severe GMail outage a long time ago.This is in addition to all the “usual” ones like scope/shadowing, uninitialized variables (usually warned, but not always), leaving off braces like
goto fail
, unexpected wrapping with small integer types, signed/unsigned problems, dangling pointers, buffer overflows, use after free, etc.string_view
is nice but it’s also a “pointer” and can dangle. Those are all reasons that code may not work when it compiles.I think leaning on the compiler too much in C++ gives diminishing returns. It encourages a style that bloats your compile times while providing limited guarantees. With C that’s even more true since I consider it more of a dynamic language (e.g.
void*
is idiomatic).Historically, C was even more dynamically typed than it is now. Types were only for instruction selection, e.g.
a + b
for 2 ints generated different code than 2 floats. That’s about it. You didn’t have to declare function return types or parameter types – they’re assumed to be ints. Reading the old Thompson and Ritchie code really underscores this.C++ has more of the philosophy of types for correctness, but it was constrained by compatibility with C in many cases. It comes from a totally different tradition and mindset than say Haskell or ML.
I am somewhat hesitant to say this about Rust or Haskell, even in jest - it’s a best an aspirational aphorism about code in these languages, and if you’re trying to think seriously about program correctness it matters that it’s very possible to write code in Rust or Haskell that compiles but is not correct (for some definition of correct). If you want to write code that you can prove is correct at compile time, that’s a noble goal and you need more sophisticated tools for doing this than the ones Haskell or Rust give you.
But yes no one says this even in jest about C++.
Such generalizations are never true in the absolute sense, but there is a noticeable difference in how often and how correct programs are when they compile for the first time in Rust vs less strict languages.
Rust does a remarkable job eliminating “boring” language-level problems, like unexpected
null
s, unhandled errors, use-after-free, and unsynchronized data shared between threads. These things most of the time just work in Rust on the first try. In C++ kinda maybe if you’re programming with NASA-level of diligence, but typically I’d expect compiling in C++ to be just the first step before testing and debugging to weed these problems out.I don’t think it’s a binary as much as it’s far as the language’s guarantees on compile-time safety through things like the type system or borrow checker making it more likely that if it compiles, it’s correct.
Yeah honestly I don’t really believe in that whole philosophy – I feel like it leads you into a Turing tarpit of a type system. There are plenty of other engineering tools besides type systems that need more love.
But I think that refactoring can be quite safe in strongly typed languages, and that’s useful. Writing new code isn’t really because you don’t know what you want yet, and you can have logic bugs. But refactoring can be, and that’s what the article is about.
Implement this function, as long as you don’t:
Then if it compiles, it’s correct.
It is true with regard to a property called parametricity. On an intuitive level it states that type parameters are used as expected. So a function
map :: (a -> b) -> [a] -> [b]
must satisfy that each element in[b]
, it must have an image wrtf
in[a]
(note that you could as well return the empty list for each input and it would typecheck, thus our guarantee is worded a bit strangely).It is very often correct though. Usually what we say is that if you understand the problem and your solution compiles it probably works. If it doesn’t work you likely don’t understand the problem.
You experience this programming in Haskell more often than Rust (I think because of HKTs) but it is still often the case in Rust.
What did you expect?
All of their uses of
const std::string &
look like things that should bestd::string_view
in modern C++. I’m curious why they refactored their code to use C++98 idioms in 2021.std::string_view
does not guarantee a null-terminated string,std::string
does. We have to use a few C libraries, so constructing a temporary string each time instead of using aconst reference
to a.c_str()
(to get aconst char*
) would require more changes and thus more testing, whereas we tried to keep this refactoring change as small and scoped to as possible, not changing behavior if not absolutely required. That will come in a next round of refactoring. Go read that part on the Big Redesign In The Sky for our refactoring workflow.Why are we doing this right now and not way earlier? We only just got an updated compiler for the specific hardware we use that supports modern C++ 17, before that we had a half-baked C++ 11 with big parts either missing or not finished. Now we have a newer compiler, thus we can take advantage of newer features.
Note that
std::string
, likestd::string_view
, may contain null characters in the middle of the string. It’s therefore quite dangerous to rely onc_str()
in any string that may contain attacker-controlled data: anything doing comparisons on thestd::string
will see the null byte, anything looking at the C string will see only the characters up to the first null.If I were doing this refactoring, I’d start by pulling out the operations that folks are doing on C strings and turning them into things that took a pair of
CharT
iterators, then incrementally rewrite the code that takes C strings to use the new versions, then switch fromconst char*
tostd::string_view
. LLVM actually went through this exact refactoring about a decade ago, when we still built with a C++98 compiler, so the lack of C++17 support wasn’t an obstacle. The flow was:llvm::StringRef
(beforestd::string_view
existed), as a generic wrapper around some non-owned contiguous range of characters.StringRef
, including adding anllvm::Twine
type for intermediate results of string modifications. A twine is sequence of string refs, so you can cheaply concatenate a bunch of strings in aTwine
and do a single allocation for the final result if you need it in a contiguous buffer (often you don’t - many of the common operations on strings are defined for twines).StringRef
instead ofconst std::string &
, which meant that the same API could be called with astd::string
or achar*
(or anything else that can construct aStringRef
, including a substring view from the middle of another buffer).I’m confused because you seem to be, in 2021, refactoring your codebase to look like ours was at the start of our refactoring, around 2008.
Also, string_view is annoyingly limited compared to string. Ideally every read-only operation on string should be available on string_view, but in reality it seems the C++ committee lost interest along the way and only added the most common ones.
I had high hopes for refactoring to use string_view, but I kept running into places where I had to create a temporary string for no reason except to get to some const method.
(Sorry I can’t list the missing methods here; I don’t remember them offhand and I’m not near a compiler right now.)
I believe that this is because of a belief that
string
andstring_view
should represent storage with some guarantees on the contents. Operations on strings live in<algorithm>
so that other string representations can be added easily.I’m curious as to why
Base::setName
is not pure virtual. That would mean that if you forgot to override it you’d get a compile-time error. Instead of the current situation where if you forget to override it (or attempt to override it but make a mistake) you don’t find out until you get an assertion failure at run-time.Oh, and GCC also supports
-Woverloaded-virtual
for anybody who would like to have this warning but isn’t using clang.I can’t go back far enough in our repository to see if there is a reason for not being an abstract base class, and I cannot open the previous version control system from here (visual sourcesafe) to look back even further.
Thanks for the tip on GCC!