1. 97
  1.  

  2. 23

    I’ve got to say. This article is so well written. In depth and entertaining. I do kind of feel like I learned more about how much work goes into making strings work in C than about Rust strings though, but I’m not complaining about it.

    Is there a down side to Rust having so many more built in methods for handling strings? Or does the strict compiler make it not an issue?

    1. 7

      Is there a down side to Rust having so many more built in methods for handling strings?

      Do you mean methods (functions) or methods (types)? If the latter, the main issue I see is the learning curve. Having 2 main string types (&str and String) is confusing at first but it’s consistent with the rest of the lang/standard library. Once you get more comfortable with the language it’s fine. Also there is some language machinery (perhaps the AsRef<str> impl for String) that lets you supply a String reference to functions that take &str, which makes them fairly painless to use in practice.

      1. 2

        Yeah, sorry, “methods” was the wrong word to use. I meant more like, is there overhead for the Rust language to handle all the UTF-8 string stuff internally by default? In contrast to C which doesn’t .

        1. 5

          There’s some overhead whenever you’re forced to deal with potentially non-UTF-8 strings which tends to happens with IO. That’s a mental and performance penalty. And obviously there’s a slight space/performance penalty in comparison to C strings. But then again, if you need that sorta performance, you can use those too.

          Your valid Strings always being good Unicode is pretty great.

          1. 6

            For the non-UTF-8 case, all of my string libraries (bstr, regex, aho-corasick, csv) handle them fine without any overhead. :-) As a bonus, they all support Unicode to some degree and will mostly just ignore invalid UTF-8, dependending on what you’re doing. The bstr docs go into a bit more detail.

            1. 4

              I have yet to find a codebase where UTF-8 validation was a substantial overhead.

              1. 1

                Validation isn’t the only overhead though. Indexing is another. But I agree with your sentiment: worrying about this is 99% a premature optimization.

                1. 2

                  Indexing of textual values is even more fringe, unless you use strings for (sub)-keys. In that case, I recommend a byte field in Rust.

        2. 2

          Nitpick (rightfully alluded to by the author):

          That last one is particularly cool - in German, “ß” (eszett) is indeed a ligature for “ss”. Well, it’s complicated, but that’s the gist.

          These days ß is a ligature for ss as much as w is a ligature for vv. Best to leave that sentence out, as it only distracts from the topic of the article.

          For extra points, a good string library would convert heinz große to HEINZ GROẞE.

          But because Unicode never had the possibility of updating casing rules, we might be stuck with ß → SS forever, short of adding a German2 locale to Unicode.

          The use of two letters for a single phoneme is makeshift, to be abandoned as soon as a suitable type for the capital ß has been developed. (Duden 1925)

          It will probably take more than 100 years to resolve, better to start working on it now.

        3. 6

          Oh wow this article looks really good on first glance. So much that I’m going to put aside some time to work through all the examples from start to finish properly.

          1. 3

            I should note that C11 did introduce a few limited functions to work with UTF-8 specifically; <uchar.h> defines char16_t and char32_t as well as mbrtoc16 (multibyte char to char16_t), c16rtomb (char16_t to multibyte char), mbrtoc32 (multibyte char to char32_t) and c32rtomb (char32_t to multibyte char). For the purposes of this specific article, however, this is not very helpful: toupper/towupper are only defined for char and wchar_t, respectively; however, you’re not guaranteed that wchar_t can actually represent a char32_t, so uppercasing remains very hard.

            Grapheme clusters make this seemingly simple tasks even harder, and this is all before realizing that these functions are also locale-dependent (necessarily so, plus this also means they are OS-dependent) because of regional variations.

            1. 4

              I should note that C11 did introduce a few limited functions to work with UTF-8 specifically

              How are any of the things you mention for UTF-8 specifically? C11 doesn’t guarantee UTF-8 interpretation for mb arguments. The interpretation depends on locale, AFAIK C11 doesn’t guarantee the availability of UTF-8 locales (and in practice, there exist Windows versions that don’t have UTF-8 locales), and changing the locale affects all threads, so you can’t change the locale from within a multithreaded program.

              C11 doesn’t guarantee UTF-16 semantics for char16_t or UTF-32 semantics for char32_t. This is even less sensible than not guaranteeing two’s complement math. At least when C got the possibility of non-two’s complement math, non-two’s complement hardware existed. When char16_t and char32_t were introduced, there were no plausible alternatives for UTF-16 and UTF-32 semantics. C++20 guarantees UTF-16 and UTF-32 semantics for char16_t and char32_t.

              1. 2

                C11, § 7.28(1):

                The header <uchar.h> declares types and functions for manipulating Unicode characters.

                (emphasis mine) J.3.4 on undefined behavior:

                The encoding of any of wchar_t, char16_t, and char32_t where the corresponding standard encoding macro (__STDC_ISO_10646__, __STDC_UTF_16__, or __STDC_UTF_32__) is not defined (6.10.8.2).

                I suppose you could read that as not necessarily meaning UTF-8 (or UTF-16 or UTF-32) if these specific macros are not defined, but at that point, an implementation is just so obviously insane that it’s no use trying to deal with it.

                Incidentally, you may find it interesting that a recent draft of C2x has dropped two’s complement support, see the mention of N2412 on p. ii.

                1. 4

                  dropped two’s complement support

                  You had me panicking for a moment, but looking at that PDF it looks like they’ve accepted N2412, which means they’ve guaranteed two’s complement support and dropped support for other sign representations.

                  1. 2

                    Right, I meant to say “dropped everything but two’s complement”, my bad. Not that I can edit it anymore.

                  2. 2

                    I suppose you could read that as not necessarily meaning UTF-8 (or UTF-16 or UTF-32) if these specific macros are not defined, but at that point, an implementation is just so obviously insane that it’s no use trying to deal with it.

                    Those macros being defined mean that wchar_t has UTF-32 semantics, char16_t has UTF-16 semantics, and char32_t has UTF-32 semantics. None of those are about UTF-8.

                    Incidentally, you may find it interesting that a recent draft of C2x has dropped two’s complement support, see the mention of N2412 on p. ii.

                    Fortunately, reasonable things flow from the C++ committee to the C committee.

                    1. 0

                      C++ is a prescriptive standard. It describes what it wants the world to be, and then hopes the world catches up. You can do that when the entire language community is on board and the language is complex enough to only have a few implementations.

                      C is a descriptive standard. C runs on a massive range of hardware. C cannot afford to exclude real existing hardware targets that work in weird ways. Ones complement and signed magnitude integers really do exist in hardware. It wasn’t unreasonable to support them.

                      Now that C has threading and atomics and is ballooning in scope it’s probably not implementable on those computers anyway, and they can just stick with C89. Getting rid of that support today, now that those platforms are so insanely obscure, is probably fine. But it was not unreasonable to support them in 1989 or 1999.

                      1. 4

                        C++ and C are quite a bit closer together in this regard. z/OS and AIX character encoding issues show up frustratingly often on C++ standardization context.

                  3. 0

                    I wouldn’t expect C to guarantee UTF-8 for argv, if I’m understanding your comment correctly. C is a descriptive standard, and the world it’s describing is not universally UTF-8, even on the systems where it is most popular. Pathnames in Linux, for example, are not UTF-8, they are arbitrary null-terminated strings of bytes.

                    1. 2

                      Short note: path names in Rust are of “OsStr/OsString” type, which “String” can be converted into, but the other way around is falliable.

                      Both have mappings to “Path/PathBuf”.

                      1. 0

                        Yes path names are of type OsStr, but command line arguments as returned by std::env::args are Strings, which means that your code works fine until it panics in production one day because someone passed a non-UTF-8 path name and std::env::args tries to turn it into a String even though you’re just going to turn it back into an OsStr anyway.

                      2. 2

                        My point is that, AFAICT, contrary to your up-thread comment, C11 provides no facilities that you can actually trust to do UTF-8 processing in a cross-platform way. (If you know the program will run on macOS or OpenBSD, then you can rely on certain functions performing UTF-8 operations.)

                        1. 1

                          I have not made any comments suggesting that C11 provides UTF-8 support.

                          C2x should have char8_t, however.

                          1. 3

                            Sorry, I meant xorhash’s comment upthread.

                  4. 3

                    This is a great article!

                    I loved both the C part (which did not surprise me at all) and the Rust part (where everything was new to me). I am left with the somewhat strange feeling that I actually prefer the C way of doing things. The “simple” Rust program does too many things under the hood and this makes me uncomfortable.

                    1. 8

                      you prefer the subtly broken implementation that half works on latin-1 left-to-right text? Imagine this on japanese or korean, or in a right to left script like arabic or hebrew. You can still use OSstring in rust or access the raw bytes if you need to, but why not use the safe version by default?

                      1. 1

                        The thing is that both implementations may be broken, but the C code is more explicit. When it fails, or it receives bad input, I can look at the code that I have in front of me and understand why it fails, and maybe how to solve it. Let us say that the rust code receives a malformed argument string, where an invalid utf sequence contains a byte with the value ’ ’. What will happen? Will the iterator break? Or the upper case conversion? Or will my program break before even the first line of my program? I always get this kind of anguish about the inevitable impredictability when working in “high level” languages.

                        1. 11

                          Let us say that the rust code receives a malformed argument string, where an invalid utf sequence contains a byte with the value ’ ’. What will happen?

                          if I’m understanding your question correctly it’ll panic when it attempts to put it into a string, which is what the example in the article shows. What it won’t do is happily continue until it reads outside its buffer or causes some other error by corrupting memory.

                          When it fails, or it receives bad input, I can look at the code that I have in front of me and understand why it fails, and maybe how to solve it.

                          you may be able to, I don’t think I can. I don’t think I’m particularly inept, but this is incredibly subtle and complicated work that people get wrong all day every day

                          1. 8

                            The nice thing about Rust here is that “breaking” will usually consist of the program panicking and exiting without doing anything unexpected or dangerous. Most of the time, such panics are caused by unwrap or expect statements where you didn’t think an error case could happen, but it infact did.

                            What seems to be very common with the C bugs, and almost, but not quite entirely, unheard of in Rust (and other safe languages) is memory unsafety issues, including leaking, corruption, disclosure, and even RCE based on coding errors.

                            1. 5

                              Let us say that the rust code receives a malformed argument string, where an invalid utf sequence contains a byte with the value ’ ’. What will happen?

                              That depends on how you retrieve the arguments. If you want the user to pass only valid Unicode, you’d retrieve the process arguments with std::env::Args(), which will panic if any of the arguments are not valid Unicode. This enforces argument validity at the boundary of your program.

                              If arbitrary byte sequences make sense, too, you’d retrieve the process arguments with std::env::ArgsOs.

                              If you want valid Unicode arguments, but want to handle malformed input gracefully rather than panic, you’d retrieve the process arguments with std::env::ArgsOs, try to parse them to UTF-8 with OsStr::to_str, and handle success and failure of the parsing attempt.

                              match arg.to_str() {
                                  Some(arg) => println!("valid UTF-8: {}", arg),
                                  None => println!("not valid UTF-8: {:?}", arg),
                              }
                              

                              All of this is set out more elaborately in the article, starting at Now for some Rust.

                            2. 1

                              C strings are null terminated arrays of bytes, as they are defined in the standard. Sometimes those bytes represent text, sometimes they do not. Some text is encoded in ASCII, some in UTF-8, and some in other encodings. Some Unicode encodings, like UTF-16, can’t even be stored in a C string, because they contain null bytes.

                              This isn’t “subtly broken”, it’s just a different level of abstraction to Rust.

                            3. 4

                              Understandable take. What kind of under-the-hood manipulation of raw strings do you imagine needing? (Rust does provide ready access to the bytes of a String as needed, just in case.)

                            4. 2

                              Thanks for those insightful anecdotes and comments about UTF-8, it’s very instructive and a pleasure to read.

                              1. 2

                                this post gave me goosebumps. what a gorgeous intro to this complex subject through easily grasped examples. wow. If I used an RSS reader I’d subscribe immediately.

                                1. 2

                                  This article was an amazing read, and I’ve finally found out why \ is ¥ on Japanese versions of Windows! Early on in my career when I did desktop support and supported Japanese staff I frequently wondered why Windows displayed paths with the Yen symbol.

                                  1. [Comment from banned user removed]

                                    1. 16

                                      If this article’s problem is that it criticizes C unthinkingly, then surely, the numerous links to the various CVEs demonstrate that many people are using C unthinkingly. According to you, I suppose we should blame those people for using C unthinkingly and for missing “obviously incorrect” C code. It should be manifestly obvious that the entire point of Rust is to reject this way of thinking and instead improve on the tools to prevent the need to detect “obviously incorrect” code in the first place. That’s also one of the points this article made several times over, albeit with a thick layer of snark at pretty much exactly the attitude you are repeating here.

                                      You know what Rust doesn’t in fact prevent you from having to deal with? Proprietary libraries that do malicious things. Malicious functions that cast their inputs to nonconst pointers and modify them are possible in malicious Rust libraries too. The solution is free software, it’s not using proprietary libraries you can’t inspect.

                                      Classic. Because Rust isn’t a panacea we can’t compare its effectiveness at preventing bugs that keep reoccurring with other tools? Nevermind the fact that said malicious functions in Rust must be clearly demarcated somewhere with an unsafe annotation. So even then, Rust is going to make auditing said problems easier than other languages where the default is unsafe.

                                      1. 0

                                        If this article’s problem is that it criticizes C unthinkingly

                                        If the article had compared how easy it is to get code using Rust’s strings right using the ICU crate for Rust vs how easy it is to screw up when using ICU, that would be a fair comparison. In both cases you’re writing code that uses a comprehensive Unicode library to do some basic string munging, and I wouldn’t be even a bit surprised if the end result was that the Rust code was shorter, simpler, easier to write, easier to read and safer.

                                        If the article had compared actually implementing the Unicode support in Rust on top of raw bytes in a naive way vs doing the same in C, and shown how Rust helps you avoid the problems you encounter when you try to do things naively, that would have been a really valuable article showing how Rust helps you avoid problems in C code. I’d be interested to know how many uses of unsafe there are in the Unicode string handling bits of Rust’s standard library. How many places is it necessary for good performance? How many could you eliminate with only small performance loss (unnecessary bounds checking, for example).

                                        Instead the article compares implementing UTF-8 string handling in C with just using the built-in UTF-8 string handling in Rust, which doesn’t seem fair to me. It’s like saying Rust is a worse language than Python because Python has HTTP built in and the author’s implementation of HTTP in 10 lines in Rust has some serious issues. That’s what I’d expect! Different languages are different levels of batteries-included and in C’s case, the battery that’s included is arrays of bytes, anything else is up to libraries.

                                        So yes I do think that the criticism of C in the article is pretty unthinking.

                                        Classic. Because Rust isn’t a panacea we can’t compare its effectiveness at preventing bugs that keep reoccurring with other tools? Nevermind the fact that said malicious functions in Rust must be clearly demarcated somewhere with an unsafe annotation. So even then, Rust is going to make auditing said problems easier than other languages where the default is unsafe.

                                        If you’re linking to an arbitrary closed-source Rust library then you can’t tell where they wrote unsafe. The article, remember, says:

                                        This would pass unit tests. And if no one bothered to look at the len function itself - say, if it was in a third-party library, or worse, a proprietary third-party library, then it would be… interesting… to debug.

                                        Which is a good point: proprietary third-party C libraries can do anything, really. It’s not unreasonable to point out that there’s nothing stopping a proprietary third-party Rust library from doing anything either, including unsafe code that you can’t even find with grep, and it would be just as interesting to debug. But also the whole actix-web situation shows that a community attitude that unsafe should be treated with suspicion doesn’t hold for every Rust library author. Large amounts of unnecessary unsafe isn’t necessarily audited just because it can be audited, just like large amounts of C isn’t necessarily audited just because it can be.

                                        I didn’t say that Rust was bad, or that it shouldn’t be compared to other languages. I merely said that it’s unreasonable to compare intentionally-poorly-written-and-incomplete code in one language with the standard library of another regardless of language. If the whole point of the article is just ‘C isn’t memory safe, Rust is’ then good job, I guess? Probably doesn’t require that the article is quite so long though. But of course it isn’t. The point of the article is closer to something like ‘string handling in C is primitive compared to string handling in Rust using the Rust standard library and thus C is worse than Rust’.

                                        According to you, I suppose we should blame those people for using C unthinkingly and for missing “obviously incorrect” C code. It should be manifestly obvious that the entire point of Rust is to reject this way of thinking and instead improve on the tools to prevent the need to detect “obviously incorrect” code in the first place.

                                        But Rust doesn’t prevent you from writing incorrect code, does it. It prevents you from writing memory-unsafe code that isn’t explicitly marked as unsafe. You can implement UTF-8 string handling poorly in Rust, just like you can in C. The difference is that if you do so in Rust, and you don’t write unsafe in any of your code, then the result will be a panic or incorrect behaviour and not a buffer overflow.

                                        Is that a valuable property? Yes. Rust’s memory safety guarantees are valuable. I bolded that so it’s clear I’m not saying Rust is bad or useless because it isn’t a panacea, which seems to be how you’ve interpreted my comment.

                                        It’s also a well known property. A much more interesting property that Rust may or may not have (I do not know) is ‘helps write low-level string handling code correctly’. C does not help you do this correctly, as shown by the incorrect string handling code in the article written in C. But does Rust have this property? We don’t know, because the author doesn’t tell us. All it tells us is that someone’s written good string handling in Rust, and the well known fact that Rust gives memory safety guarantees. A lot of text for not much.

                                        That’s also one of the points this article made several times over, albeit with a thick layer of snark at pretty much exactly the attitude you are repeating here.

                                        I think it’s really very rude of you to ascribe my attitude as the tired old ‘C is fine just write correct code, thus Rust is pointless’. That’s not my view and not at all what I said or implied.

                                        1. 0

                                          Personally, I think you’ve completely missed the point of the article.

                                          But now that I realize you are milesrout, I’m just going to go back to ignoring you. It’s just not a productive use of my time to talk to you. I wouldn’t have responded to you originally if I had known who you were. But you had changed your username (and have since changed it back).

                                          1. [Comment from banned user removed]

                                            1. 1

                                              Perfect example of why I ignore you on this site. It was going well. I’ll just have to make sure I check user accounts now before responding to an unfamiliar username.

                                              1. 0

                                                It was not going well. Both comments you made are as if you didn’t even read the comments you were responding to.

                                                1. 0

                                                  Oh, no, it was. I have successfully avoided interacting with you for quite some time. There have been times where I’ve started writing a comment reply to you before seeing the username, after which I would delete what I wrote before responding. But you changed your username and I managed to get sucked into your bullshit. So I’ll be more careful in the future.

                                      2. 10

                                        Of course C doesn’t have UTF-8 strings built in. Why would it? There are a great many C programmes running in the wild on systems with less memory than the size of the tables you need for Unicode support.

                                        People have already addressed the other points. But this is suggests that you can’t support UTF-8 without large unicode tables, which is false. One can do many useful things with UTF-8 data and UTF-8 strings without unicode tables, such as:

                                        • Validating that a byte stream/array is valid UTF-8, which only requires looking at the leading bits.
                                        • Determining code point boundaries, which again only requires at leading bits. This results in many useful things, such as being able to: count the number of code points, indexing into code points, iterating over code points, etc.
                                        • Re-encode to another unicode encoding (UTF-16, UTF-32).
                                        1. 1

                                          I totally agree, those would be some really nice very useful library functions to have standardised. I suspect they won’t be, just because they’re not especially difficult to implement correctly and portably already compared to the effort required in standardising them, but it would be nice.

                                        2. 6

                                          The solution is free software, it’s not using proprietary libraries you can’t inspect.

                                          I think the evidence is clear at this point that free software (whether we mean free as in beer or speech) in and of itself does nothing to help security or reliability. Witness Heartbleed. Present in open-source software for years. The “bazzar” didn’t find it. Volunteer developers didn’t find it.

                                          Not to mention the constant stream of 0-days popping up and getting fixed in major browsers, all of which are open source, and basically none of which have security bugs found and fixed by community volunteers AFAIK.

                                          It seems clear that, based on evidence, neither being open to a world full of possible inspectors or having dozens of specialized testing tools and hundreds of highly paid experienced engineers can prevent software in C from exhibiting memory unsafety errors. Rust seems to show much more potential so far.

                                          1. 6

                                            You know what Rust doesn’t in fact prevent you from having to deal with? Proprietary libraries that do malicious things. Malicious functions that cast their inputs to nonconst pointers and modify them are possible in malicious Rust libraries too. The solution is free software, it’s not using proprietary libraries you can’t inspect.

                                            Rust’s implementation is F/L/OSS, the most conveniently-accessible libraries in Rust (the ones on crates.io) are all source-available, and, since Rust has no stable ABI, the best way to provide a first-class Rust library experience is to make its source available. Obviously the Rust core team can’t make all the proprietary libraries not exist, but they’ve thoroughly optimized in favour of free software ones.

                                            Nobody’s going to ask you to choose between free software and Rust. The obvious best option is both.