1. 7
  1.  

  2. 3

    It’s much worse than that in some languages. Swift, for example, defines grapheme clusters as the unit for the length of a string and so the length of a string can change between releases of your Swift compiler.

    1. 4

      Unfortunately, grapheme clusters are the closest thing there is to a “character”, so any other option (except relying on the ICU library in your OS instead) would yield even more nonsensical results.

      1. 1

        I haven’t yet seen a situation where I actually want to know how many grapheme clusters are in a string. When would this be useful? Maybe if you’re impmementint a text rendering engine?

        1. 1

          Not grapheme clusters; characters.

          1. 1

            Ok, but same question. Even if there were a reliable way to count the number of “characters” in a string of UTF8, I’ve never encountered the need to do that. Maybe it’s just the kind of code I work on that it doesn’t come up?

            Even if you’re presenting a string to the user, don’t we 99.9% of the time just pass the UTF8 bytes to a rendering library and let it do the rest?

            On the other hand, I very consistently need the length of the string in bytes for low-level, memory-management type work.

            1. 2

              If you write web apps, everything you receive from an HTTP request is effectively stringly-typed simply because of the nature of HTTP. And many of the data types you will convert to as you parse and validate the user’s input will involve rules about length, about what can occur in certain positions, etc., which will not be defined in terms of UTF-8 bytes. They will be defined in terms of “characters”. Which, sure is a bit ambiguous when you translate into Unicode terminology, but generally always means either code points or graphemes.

              And you can protest until you’re blue in the face that the person who specified the requirements is just wrong and shouldn’t ask for this, but the fact is that the people specifying those requirements sign your paychecks and have the power to fire you, and a large part of your job as a web app developer is to translate their fuzzy human-language specifications into something the computer can do. Which, again, almost never involves replacing their terms with “UTF-8 bytes”.

              Amusingly this has become a problem for browsers, because the web specs are still written by people who think at least partly like you do, and so have a significant disconnect between things like client-side validation logic setting max length on a text input (which the spec writers tend to specify in terms of either a byte limit or a UTF-16 code-unit limit), and the expectations of actual humans (who use languages that don’t cleanly map one “character” to one byte and/or one UTF-16 code unit).

              IIRC Twitter actually went for the more human definition of “character” when deciding how to handle all the world’s scripts within its (originally 140, now 280) “character” limit, which means users of, say, Latin script are at a disadvantage in terms of the byte length of the tweets they can post compared to, say, someone posting in Chinese script.

              1. 2

                I think the twitter example is an interesting one that I hadn’t thought of. Limits on lengths are usually due to concerns about denial-of-service attacks, but not for Twitter. It’s actually supposed to be a limit on some kind of abstract human idea of “characters”. I don’t envy having to figure that one out.

      1. 3

        Except that… yeah, it is wrong.

        If you’re designing a programming language, “code units of the in-memory representation used for Unicode” is probably the least useful/practical way to expose Unicode to the programmer.

        You can make an argument for code points. You can make an argument for graphemes. You can make an argument for raw bytes. But the choice in some languages to go with “a string is a sequence of UTF-16 code units” just does not work now. It gets the complexity of byte-based representations (most importantly, the potential to accidentally break up a code point that spans multiple code units, comparable to accidentally splitting a multi-byte code point in byte sequences) without gaining any of the low-level advantages you’d have from just exposing the actual byte sequence. And it so often looks close enough to a sequence of code points that it can lull the programmer into a false sense of security and lead to hidden bugs that wouldn’t have occurred if it really was code points.

        (and yes, I know for some languages it’s a historical accident from back when people thought 16 bits would be enough, but today we know better and I can’t see any justification either now or at the time that article was written for claiming that a code-unit-based approach is OK)