1. 15

Or: UTF16 handling of astral planes and implications for JavaScript string indexing

  1.  

  2. 14

    Pretty good article overall, but a few things to fix:

    • Please spell UTF-8 and UTF-16 consistently with a hyphen.
    • Each plane has 65536 code points, not 65535.
    • Rust strings do not support rune indexing; they only support byte indexing. However, the str.chars() method returns an iterator that yields each rune sequentially. Moreover, this iterator can be collected into a Vec<char> which is in UTF-32.
    • CJK characters are 3 bytes characters long in UTF-8, not 2 bytes as you stated. This is because the 2-byte limit is U+07FF, and all CJK characters are above U+3000.
    • You mention that UTF-8 is a variable-length encoding by design, but fail to acknowledge that UTF-16 is variable-length too.
    • Python 2 is a can of worms (which depends on compile-time flags), but Python 3 is not. In Python 3, all strings behave as if they are in UTF-32 (i.e. direct access to every code point). But internally, it can store strings as UTF-32, UTF-16, or UTF-8, depending on the actual content of each string (e.g. a pure ASCII string can be stored as UTF-8 and still allow fast O(1) indexing).
    1. 6

      Rust strings do not support rune indexing; they only support byte indexing. However, the str.chars() method returns an iterator that yields each rune sequentially. Moreover, this iterator can be collected into a Vec which is in UTF-32.

      Stringactually doesn’t support byte indexing either, it just has range indexing to get a slice of the string, but you can use the as_bytes method to get a slice of bytes instead.

      1. 3

        This is correct and this is a really important point I think for Rust string indexing, I’ll modify the post to reflect this.

      2. 3

        You mention that UTF-8 is a variable-length encoding by design, but fail to acknowledge that UTF-16 is variable-length too.

        I don’t understand this criticism. The entire article is about the special case where UTF-16 is variable length - the case where you need to use surrogate pairs to represent a single unicode code point.

        1. 2

          Thanks for the feedback! I’m really glad the article initiated some interesting comments here. There are a few points which need clarifying and correcting which I’ll do in an update.

          You’re right about the Rust string indexing requiring .chars() iterator, I think this is a very important point so I’ll edit the post to explain this.

          I tried to point out how UTF-16 isn’t fixed length but I definitely think it could be made more clear.

        2. 4

          wchar_t is likely to be 32 bit. The standard in fact requires it can represent everything in the largest supported locale.

          1. 4

            C - natively uses 8-bit char able to handle only the ASCII character set. C99 introduced wchar_t meaning a 16-bit wide characters able to handle all Unicode via the surrogate pair method used by JavaScript (just don’t try to implement this yourself unless you are really really into this kind of thing).

            Wrong. C strings have an implementation defined encoding. Other than requiring it to be null-terminated, portable C programs can’t assume that strings are ASCII.

            In common POSIX-y configuration, like Ubuntu and macOS, C strings are UTF-8. Go ahead and make an emoji filename on your Linux box; it’ll work just fine (and be described as having a length of 4 by wc -c).

            Linux and Windows even allow you to even configure it to use alternate codepages instead of ASCII. Configure HKLM\SYSTEM\CurrentControlSet\Control\Nls\CodePage\ACP on Windows and the LC_ALL and LANG environment variable on Linux to choose what encoding the C strings will use.

            Let’s also, for sake of argument, pretend that there aren’t C implementations that use EBCDIC. They don’t support Unicode anyway, so it doesn’t matter.

            1. 7

              Let’s also, for sake of argument, pretend that there aren’t C implementations that use EBCDIC. They don’t support Unicode anyway, so it doesn’t matter.

              They actually do…

              1. 4

                Indeed they do. I used them to implement Swift on z systems!

                1. 1

                  Okay, so they do support it. Now I know…

              2. 2

                Thanks for the feedback, I’ll post an update correcting this.

                That makes sense as C standards very rarely define how things should be implemented in any way, shape or form, even when a particular implementation is ubiquitous. This makes it easy to fall into the trap of assuming that something is part of the spec when actually it’s not (e.g. char is 8 bits when spec says at least 8 bits, long int is 32 bits long when it’s at least 32 bits, and so on).

              3. 2

                Ya, JavaScript strings being specified as ucs2 is extremely frustrating. But I think it’s disingenuous to call it utf16, if it were, this whole surrogate pair business would be transparent to the user.

                Thankfully String.prototype.codePointAt with String.fromCodePoint can be used to iterate over a string correctly, though annoying as it is.

                1. 2

                  If you can, could you change the footnotes from simple * to links with back links?

                  1. 2

                    Good idea. I’ve got some clarifications and corrections to make following on from some of the interesting comments here so I’ll make the footnotes a bit more useable.