1. 32
    1. 8

      Great write-up. The MULTICS kernel developers also used PL/I with prefixed strings specifically to avoid potential vulnerabilities as described here in Section 2.3.1. Its successor, UNIX, did the opposite with many vulnerabilities resulting. I always pushed for prefix strings where possible. Of course, now days we have tools like Frama-C and static analyzers that might make it a moot point for those using them.

      Fun fact: null-terminated strings are in C mainly because it worked well in the PDP-11’s limitations as described here. Them sticking around after hardware improved wasn’t about what was best, design decision. Definitely worth reconsidering string implementation where possible if you’re using Core i7’s and such with chat apps using 1GB of RAM. :)

      1. 5

        Thanks for the history lesson. I added a sentence about that and linked your comment at the bottom.

        1. 2

          Thanks for the credit!

    2. 4

      This is really nice. Thanks for writing this up.

      A popular optimisation is to store capacity in log2 – use char capacity then realloc(old,1<<new_capacity) – this saves a bit of space and all the CPUs you’re likely to target have a special operator for capacity checking. 1<<255 is stupid big and 1<<47 requires only 6 bits so you have a couple left over which is handy (I use it for storing whether the region was mmap’d directly from a disk object).

      There’s a further trick you can use to get another couple bits – you can store capacity in a LUT. logP might use 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53 which gets you to 4-bits with only a little waste. You can also use padding, e.g. 16<<43 is the same as 1<<47. And so on.

      This works for any kind of array, not just character-arrays (strings).

      If you’re dealing with objects, giving each of your objects a “bucket” number based on its size can work similarly.

    3. 8

      The C programming language defines a string as char*.

      No it doesn’t. char* is a type. String is not a type in C.

      Here is how the definition reads: A string is a contiguous sequence of characters terminated by and including the first null character.

      The only advantage of this representation is space efficency.

      There are other advantages, such as its simplicity. Or the fact that you can split a blob of bytes into strings without having to first count the number of strings and allocate a bunch of length-capacity-pointer tuples for each. It works great with fixed size buffers and static storage. (I’ve heard static analysis tools like fixed sizes and static storage.)

      If you’re going to shit on C strings, try not to start the post with a completely incorrect statement. It’ll look a little more credible.

      I’m not yet quite convinced that replacing simple NUL-terminated strings with reference counting, chunks, lengths and capacities and the arithmetic that goes with it plus allocated storage is all inherently safer.

      I’d say most of these supposed safety benefits can only be realized when there’s a complete, carefully vetted API & implementation that wraps it all in a way that doesn’t allow the user to shoot themselves in the foot. I’m afraid it’d have to be opaque. Because if people start doing arithmetic and tricks with the bare pointers, they will fuck it up. And I’d probably have to take out features like slices, because they’ll either surprise somebody (when some string suddenly got immutable) or immutability isn’t enforced by the API and someone shoots their foot off.

      Without these restrictions, it’s probably no more safe than plain C strings. Heck, I’d say the number of things you can get wrong is just greater. It’s a more complicated alternative to C strings.

      1. 12

        I’d say most of these supposed safety benefits can only be realized when there’s a complete, carefully vetted API & implementation that wraps it all in a way that doesn’t allow the user to shoot themselves in the foot.

        I think that’s the whole point. C’s problem – and yes, it is a problem – is that it purports to have a string API but it basically doesn’t. Having programmed in C for a while and having used Lisp, Python and vast array of other languages, there’s no question that C’s “string” API is the least friendly and most difficult to use. As you say, string is not a type in C.

        Without these restrictions, it’s probably no more safe than plain C strings.

        This is true if you expose these to the user. Most (all?) APIs don’t so this. This is precisely why C “strings” are unpleasant to work with when you’re doing actual string manipulations.

        If you’re going to shit on C strings, try not to start the post with a completely incorrect statement.

        I think it’s universally accepted that a “C string” is a char * even though the standard may not say it directly. Besides, you have a whole section of the standard (7.21 in my copy) entitled “String handling” where most functions take a const char * as the “string”.

        1. 2

          I totally agree on the (lack of a good) API being a big problem. Then again, if you had a good, opaque string API, the user-perspective discussions about nul-terminated-vs-pascal-strings should just die because it’s an implementation detail at that point. Internally, different implementations could have however few or many different possible representations to support optimal space & efficiency characteristics strings of different lengths and access patterns. (Do I want something that complicated in C however?). The whole complaint about C “strings” being unpleasant to work with becomes moot because you’ll be working through an API instead of the internal representation.

          I think it’s universally accepted that a “C string” is a char * even though the standard may not say it directly.

          A pointer-to-char can be a NULL pointer, or it can point to zero, one, or more C strings. The storage for a string definitely doesn’t reside inside a pointer-to-char.

          It might sound like pedantry but way too many people are confused over arrays, pointers, and strings in C that it is impossible to tell whether someone’s just casually using sloppy terminology or really not thinking straight. A pointer-to-char is not at all a string, and I prefer not to call it one so as to avoid contributing to this confusion.

          And yes, I’ve seen people hating on C strings because they are afraid that they are not NUL-terminated. Such a thing is by definition not a C string, but people get confused because they think char* is a string. If you have to worry about that kind of thing, it’s no surprise the already somewhat lackluster API can feel like a complete minefield! It doesn’t help that there are broken functions that can take a pointer to string and produce something that is sometimes not a string. Avoid these.

          If you only people would use the APIs that guarantee the output is a string if the input is a string, life suddenly gets a lot easier. But you also need to recognize whether you’re working with strings or just pointers to an arbitrary blob of chars.

          1. 2

            Then again, if you had a good, opaque string API, the user-perspective discussions about nul-terminated-vs-pascal-strings should just die because it’s an implementation detail at that point.

            You are probably looking for vstr if you don’t know about it already. It’s the best “good, opaque string API” I’ve found. Its biggest problem is probably being so comprehensive (large) but the author has a good solution in the form of a second, smaller (smaller) “opaque string API”.

            However these are not the best string API. Opacity has a cost, and if it’s not clear, consider doing operations on many strings – where is vstr_srch_manyvstr_any()? Who has good support for building DFA or a bloom filter? Why is parsing so hard? Details sometimes matter.

            The best string API is an array API that treats a string as an array of characters. If you have good tools for arrays, you’ll have no problems with strings. Even in C. Hiding inside of q/kdb+ (for example) are bindings for C that create and manipulate performant arrays, but because they’re arrays you can mmap them, or use a vm_remap/memfd, or use any other library that already works with arrays. Or implement your fancy special search or whatever – you’ve got a very solid foundation for anything string related: A fast webserver, a algol/C-like programming language parser.

            1. 2

              Once you’re in Unicode territory, it’s difficult to even think of arrays of characters. Not all of the code points represent a standalone character, and there are often several different code point sequences that end up mapping to the same visible or even semantic construction.

              Basically: strings are exhausting, even with good facilities!

              1. 1

                Once you’re in Unicode territory, it’s difficult to even think of arrays of characters.

                So many people had the exact-same idea, so I shouldn’t think it that difficult: C’s early efforts into Unicode were of a wide character; Erlang implements a “string” as a (linked-) list of integers. Java thought they were so clever for making a “character” 16-bits. And so on. Maybe 16-bits isn’t enough, maybe we need 32-bits. Or the full-numeric tower. Or nested lists (for combining ints!).

                Without passing judgement: This actually works very well if you have enough memory (you probably do), and if you’re going to be doing a lot of glyph or code-point level manipulation.

                However there’s a reason most people kept the byte-array and just used utf8: It was good enough, and people don’t actually index or manipulate the arrays of characters that much anyway. I certainly never need to know the number of characters or codepoints or surrogate pairs, runes or glyphs. And on one hand I can count the number of times I’ve needed to write a routine that needed to compute the number of pixels wide a string would be rendered. My advice: Keep them packed. Save the bits.

      2. 5

        My intention was not to “shit on C strings” but give an overview over the alternatives.

        You are right that Pascal strings cannot be split without reallocations while I can insert NUL into C strings. I did not think of that use case. Where does it usually occur?

        1. 4

          My intention was not to “shit on C strings” but give an overview over the alternatives.

          Then I apologize for using such an expression. I guess it’s just too fashionable to hate on anything and everything to do with C these days, and it’s getting to me. “C strings are only good for one thing, here’s how to do it better” just sounded like it..

          You are right that Pascal strings cannot be split without reallocations while I can insert NUL into C strings. I did not think of that use case. Where does it usually occur?

          Tokenizers & parsers (or just field splitters, see anything that uses strtok or strtok_r) for simpler formats where parts are always separated by a character that you can replace with a NUL.

        2. 3

          I can insert NUL into C strings. I did not think of that use case. Where does it usually occur?

          So I’m reading a file. I read() or mmap() the beast.

          Now I want to break it into lines.

          One way is to copy each line into a new buffer. Another would be to treat the entire file as a chunk and create a list of rc_string.

          And yet another would be to scan for a newline, replace it with 0, use the string, then resume from the new offset. this is extremely space-efficient and very simple code:

          for(x=p=buffer;p<endp;++p)
            if(*p=='\n'){*p=0;doit(x);x=p+1;}
          
    4. 2

      Most of pre c++11 std::string implementations used copy on write approach. Post c++11 they use short string optimisation (which is not mentioned in this post).

      1. 2

        Short string optimization is now mentioned. I got a great link from HN.