1. 29
  1.  

  2. 7

    But I think it’s clear that a neo-C with explicit-length strings would have been significantly more complicated from the get go than the C that we got, especially very early on in C’s lifetime.

    Unfortunately, I think time has shown that ‘simplicity’ is not everything, unfortunately. It may be simple in some sense, but you still have to track all of this to write code in a robust fashion. By operating on a type with length encoded in the first place, it may be slightly harder, but it also makes sure you get it right.

    I guess I agree with the first comment:

    Once you’ve de-prioritized safety, I think you’re right.

    1. 24

      I agree with this; I think C’s tradeoffs on things like strings are now wrong for today and probably have been wrong for a while. But I also tend to think that people who ask for C to have been different from the start are kind of asking for 1970s C to be a 2000s era language (with a 2000s era compiler on 2000s era hardware) instead of a 1970s one. I could go as far as to say that the tradeoffs C made at the time were probably necessary then.

      (I’m the author of the linked-to entry.)

      1. 4

        That is also very fair :)

        1. 3

          I don’t ask for C to have been different from the start. But I do ask for people to stop using it.

      2. 2

        I don’t really follow the argument here…it seems to rest on the idea that using a suffix of a string as another string is a common operation in C. I’ve written a lot of C, and other than maybe checking a file extension I don’t remember that being common at all. Stepping through the characters, comparing some characters, sure, but treating a pointer to the middle of a string as a string itself?

        On the other hand, I wrote a lot of C using the original Mac OS toolbox, whose APIs all used “Pascal strings” whose first byte was the length count, and don’t remember any major issues other than remembering that fact.

        Windows NT APIs (not Win32 but the low level NT calls) use strings with initial 2-byte lengths…again, I don’t remember this being an issue.

        1. 3

          I’ve been writing some string manipulation code with C, and I rather liked being able to just move pointers around and insert null characters to manipulate strings in place. Being able to split a longer string into many substrings just by making pointers and nulls is neat.

          1. 2

            It may be relevant to consider here that document processing and publication was the first real use of the C/Unix system at Bell Labs. It was the justification that Ken Thompson and Dennis Ritchie used to get the PDP-11 that would give them the room to do more than the proof-of-concept experiments they were trying out on the borrowed PDP-7 they were using. Instead of selling it as an OS development project, which is what Thompson really wanted to do since they dropped out of the Multics project, they sold it as a document processing platform for their patent clerks to use to prepare patent documents. In the context of editing large text documents, this usage of substrings of strings as strings is probably a bit more well-motivated.

            There’s now a github repository where a ‘development history’ such as can be reconstructed today from backup tapes captures the development of Unix from its start as PDP-7 assembly to relatively modern SysV and BSD forms. The really early versions, from when Unix was still entirely an internal project to Bell Labs, are pretty interesting. You’ve got the bare bones of a minimalistic multi-user operating system and then a whole bunch of sophisticated document preparation tools (roff and friends, the successors of which still format all our Linux man pages into nicely formatted ASCII text or Postscript) and compiler development stuff.

            There is plenty to criticize about C in today’s context; if you look at the amount of change that pre-existing standardized languages like Fortran (or even C++, recently) have gone through over the years, C basically ended up fossilized once it got through the ANSI process in ‘89. There are plenty of design decisions that ought to be revisited in light of the sophistication of today’s compiler technology and developments in programming language theory, but the standardization committee seems content to let C be a legacy platform and have C++ be where all but the bare minimum of new features and fixes get integrated.

            But in the context of the late 1960s and early 1970s when the key design points of C were set (most of the semantics were straight from BCPL, and the type structure owes much to Algol 68) C was a remarkably well-designed language for what it was intended to do and how it was used. For contrast, the MUMPS language was developed just a few years before on a PDP-7 as well; take a look at it if you want to see how terrible things could be. People still write MUMPS code to run hospital databases today….

            1. 1

              It can be used for more general substrings than just suffixes if you’re willing to stick destructive NULL placeholders in the string during in-place manipulation. Since strtok() dates to an early version of C, that seems to have been, if not a common use, at least a use that the designers had in mind at the time.

              1. 2

                Yep, strtok for tokenizing and parsing strings into structures without copying. Not that important on modern computers, but back then …

            2. 1

              Advantages of null-terminated strings:
              1. Can truncate an existing string wherever you feel like (Ignoring multi byte string issues) 2. Backwards compatibility (At best we can hope for an interim period where both implementations are used for many years)
              3. Language independent 4. As long as our strings are null-terminated correctly we will never get a buffer overflow (And as long as we check the length of source strings before allocating/placing them into destination strings)

              Advantages of something like a struct/class with size_t length; char* buffer; (Not pascal strings, 255 bytes hasn’t been enough for a while): 1. Don’t have to waste time calling strlen, even better we don’t have to call it redundantly multiple times 2. As long as we are provided the correct lengths we will never get a buffer overflow (And as long as we check the length of source strings before allocating/placing them into destination strings)

              Security:
              For a null-terminated string if we are passed a string that doesn’t have a null-terminator in it we can crash when trying to read past it.
              On the other hand we can do the same thing by passing an incorrect length.

              1. 2

                A classic security issue with null-terminated strings is using them for things that can actually contain null bytes, causing subsequent code to see a truncated version of the original. I believe that is the root cause of CVE-2009-2702 (‘KDE KSSL null character certificate spoofing vulnerability’), for example.

                1. 1

                  Ahh I see, this is part of a patch for that issue:
                  - if (!s.isEmpty()) {
                  + if (!s.isEmpty() && + /* skip subjectAltNames with embedded NULs */
                  + s.length() == d->kossl->ASN1_STRING_length(val->d.ia5)) {

                  That is a problem with having a size_t member, you now have two lengths, the size_t one and the the nulll terminated one. Which may or may not be the same depending on where you got it from or how it was calculated. We get this problem in C++ too and probably most C string object libraries..