1. 8
    1. 4

      These were rejected from glibc as ‘horribly inefficient BSD crap’ by Ulrich Drepper (glibc maintainer at the time). I wonder how many millions of dollars of security vulnerabilities that one decision was responsible for.

      There are good reasons for avoiding these functions. They are memory safe, but they might accidentally truncate strings, which can have other security implications (imagine if the strings are paths, for example). Generally, they catch bugs where you think you’re tracking the length of a string but the length is wrong, which should be impossible with a good string abstraction but is depressingly easy in C.

      C++’s std::string is one of the worst string abstractions in any programming language, but eliminates the entire class of bugs that functions like this in C were designed to help mitigate. These days, if I need to deal with C string interfaces, I use std::string or std::vector and check in code review that every call to .c_str() or data() is accompanied by a call to .size() at the same location. If you have to use C, there are a load of string libraries (I think glib has a moderately competent one) that avoid all of these pitfalls. Dealing with raw C strings is best avoided where possible.

      1. 3

        C++’s std::string is one of the worst string abstractions in any programming language

        In what way? C++ deficiencies aside, and maybe the requirement for C-during compatibility (which I would argue is also a C++ deficiency), its worst sin seems to be that string_view took until C++17, which is hardly a fault with std::string as an abstraction.

        1. 6

          So very many reasons. A few off the top of my head:

          • It doesn’t actually contain characters, it contains bytes. There are some specialisations that, in recent versions of the standard, are defined to hold UTF-8/16/32 code units.
          • Even where it does contain unicode, it doesn’t actually provide you with any interfaces for iterating over unicode code points, so using the methods for trimming the string can easily transform valid unicode into invalid unicode.
          • When it doesn’t contain unicode, it doesn’t capture the encoding that it is storing in any way, so good luck if someone hands to a std::basic_string<char> (a.k.a. std::string).
          • It tightly couples the representation (a flat array of bytes) with the interface (something std::string_view also does). I’ve managed to get 50% aggregate transaction throughput processing improvements in server workloads and similar efficiency gains in desktop apps from changing the underlying representation of a string to tailor it to the specific workload, leaving that kind of performance on the table makes no sense for a language that generally micro-optimises for performance at the expense of everything else.
          • It interoperates with C in exciting ways. Its size() method returns the number of characters in the string. A std::string is required to have a null byte at the end of its buffer (not explicitly, but as a result of a combination of other requirements), but whether this null byte is counted as part of the length depends on how the string was created and so you can end up with exciting length mismatches.
          1. 2

            The only thing which doesn’t seem to be a C++ deficiency here is the 4th item, and maybe the 5th.

            And the 4th seems relatively normal? You don’t explain what your changes were, but even for C++ the standard collections can’t cater to every use cases, and trying to do so can yield significantly worse results (std::vector<bool> being one such cautionary tale).

            1. 5

              The only thing which doesn’t seem to be a C++ deficiency here is the 4th item, and maybe the 5th.

              I disagree, most of these are fixable in C++. A number of them are fixed in C++ with third-party libraries. ICU has string objects that don’t suffer from any of them, though ICU is a huge library to pull in to just fix these problems.

              And the 4th seems relatively normal?

              Most Smalltalk-family languages (I think JavaScript is probably the only exception?) provide an interface for strings and permit different implementations. Objective-C has APIs for very efficient iteration, which allow implementations to batch. ICU’s UText abstraction is very similar and is written in C++.

              You don’t explain what your changes were, but even for C++ the standard collections can’t cater to every use cases, and trying to do so can yield significantly worse results (std::vector being one such cautionary tale).

              This is a false dichotomy. The goal is not to cater to every use case, it’s to allow interoperability between data structures that cater for each specific use case. My string might be represented as ASCII characters in a flat array, UTF-8 with a skip list or bitmap for searching code point breaks, a tree of contiguous UTF-32 code points for fast insertions, a list of reference-counted immutable (copy-on-write) chunks for fast substring operations, embedded in a pointer with the low bit for tagging to avoid memory allocation for short strings, or any combination of the above (or something completely different - this list is a subset of the data structures that I’ve used to represent runs of text in different use cases). In a language with a well-designed string abstraction, no one needs to know which of these I’ve picked and I don’t need to make that decision globally, I can choose different representations without changing any callers.

              With Objective-C’s NSString and NSMutableString (which are not perfect, by any means), I can try a dozen different data structures for the strings in the places where string manipulation is the performance-critical part of my workload, without touching anything else. I can do the same in C++ codebases that use ICU’s UText. I can do it with (usually) better performance in C++ with a few C++ string libraries that define abstract interfaces and template their string operations over concrete instantiations of those interfaces. I cannot with interfaces that use std::string and I cannot easily write a templated interface where std::string is one of the options because its representation leaks into its interface in so many places.

              1. 2

                I disagree, most of these are fixable in C++. A number of them are fixed in C++ with third-party libraries. ICU has string objects that don’t suffer from any of them, though ICU is a huge library to pull in to just fix these problems.

                You can’t change the existing APIs, they just can’t be fixed.

                Adding “correct” behaviour and enumerating multibyte characters correctly means you need a significant chunk of the complexity of ICU, otherwise you’re restricted to just enumerating codepoints, and you lack many of the character introspection functions you would often want.

                1. 1

                  If you can operate on code points in the core string APIs then it’s easy to add the richer Unicode things in an external library that cleanly interoperates with your core standard library.

                  1. 1

                    That argument applies to returning chars as well, and returning individual bytes is much more efficient, etc.

                    What is the case you see where enumerating code points is the correct behavior?

                    1. 2

                      To me, it all comes down to the ability to provide my own data structures. Consumers of text APIs want to be able to modify code points (e.g. to add or remove diacritics), they never want to be able to add or remove a single byte in the middle of a multi-byte character because doing so can corrupt the whole of the rest of the string. If the storage format is exposed, I can’t store the raw data in a more efficient encoding. For example, both Apple and I have made some huge wins from observing that a very large number of strings in desktop applications are short ASCII strings (path components, field names in JSON) and providing an optimised encoding for these, behind an interface that deals with Unicode characters so calling code is oblivious to the fact that the data is embedded in the pointer most of the time (in desktop apps that I profiled, this one optimisation reduced the total number of malloc calls by over 10%, Apple took it further and added 6 and 5-bit encodings with the most common characters, so probably save even more in iOS). If the caller needs to know that this string is ASCII (or one of Apple’s compressed encodings) because bytes are exposed then I can’t do this optimisation without modifying callers.

                      1. 1

                        This implies it would be more valuable to have explicit iterators (e.g no randomized default), e.g. ::codepoints(), ::characters(), ::bytes_if_thats_what_you_really_want(), etc

                        ::codepoints() on its own means that the common cases (displaying characters, substring, etc) aren’t possible without using a separate library to do the codepoint->character coalescing. The bytes->code point conversion is at least a trivial static one, that anyone could do. Code points to character conversion is the hard one, and it’s one where you want to be in agreement with your platform, not whatever external library version you’re on.

                        Of course the problem is that that might make iteration “less efficient” per the C++ committee, and heaven forbid anyone do something useful and correct if it might be slower than doing something incorrect \o/

          2. 2

            When it doesn’t contain unicode, it doesn’t capture the encoding that it is storing in any way, so good luck if someone hands to a std::basic_string (a.k.a. std::string).

            This is no longer an issue. If I have an 8-bit string, the encoding is UTF-8 unless otherwise specified. And I’m going to demand a wrapper struct to carry the encoding identifier along side the string. And a detailed explanation for why you aren’t using UTF-8 or converting to UTF-8 at the system ingest boundary.

            1. 2

              Good for you. C++, in contrast, will happily construct a std::string from a char* provided by some API that gives you characters in whatever character encoding LC_CTYPE happens to be set to by an environment variable for the current user. Now, to be fair, that’s a POSIX problem not a C++ one, but it’s still a big pile of suffering.

          3. 2

            Pretty much all these problems are a by product of the age, much like every other API from that era. Then you run into the general unbreakable API problem all C++ APIs have.

            The real issue is “wtf has there not been a new interface added that can handle the multibyte characters?”. Part of the problem is that the various C++ committees seem super opposed to anything that adds “overhead” and their definition of overhead can be annoying :-/

            1. 1

              I would argue that conflating representation with interface is not a problem of age. Smalltalk-80 didn’t do this and it was one of the influences of C++. OpenStep kept the separation to great effect (and it was critical to NeXT being able to run rich DTP applications in 8MiB of RAM) and it predates the C++ standard that introduced the core of the modern standard library by six years.

          4. 1

            whether this null byte is counted as part of the length depends on how the string was created

            I’ve never heard of this issue and never run into a string whose length includes the trailing null — have I just gotten lucky in all the time I’ve been using std::string?

            1. 1

              If you construct them from C strings, you’re fine. I’ve hit this problem twice in production code, in about 20 years of using C++. The second time was fairly quick to debug because I knew it could be a problem. The issue the first time was that one string was constructed from a C string, the other from a pointer and length, and the length (due to the flow that created it) was the length of the allocation including the null. One was used as a key in a hash table, the other was looking it up. The two strings were different and so they didn’t match. It took me a day of debugging because even printing the two strings in a debugger showed the same thing. I eventually noticed two identical strings with different lengths.

              1. 1

                Oh! Yes, you can get string objects with embedded nulls, including at the end. I agree, that can be super confusing when it occurs.