1. 5
    1. 5

      It’s a shame we went through that “65536 characters ought to be enough for anybody” era that gave us UTF-16, and that string APIs designed during that era are stuck with that model — not just JS but also Objective-C/Foundation’s NSString. IIRC Windows APIs also use it? It doubles the memory usage and still doesn’t save you from the complexities of multi-unit characters.

      (Do JS interpreters actually store strings as UTF-16, though? I know NSString will use 8-bit encodings when the string contains only ASCII. Of course that increases the time complexity of string operations.)

      1. 4

        Not to mention this was the #1 motivation for, and cause of, the Python 2 -> 3 incompatibility, and now Python 3 is moving toward UTF-8 again!

        That was human-centuries of work!

        Lots of people didn’t understand why the reference implementation of https://www.oilshell.org/ is in Python 2. I ported it to Python 3, and then back to Python 2! Both ports were easy, but Python 2 is better for our UTF-8 centric model.

        (And to answer another question I get often: we’re moving away from reusing any part of Python code at all in the runtime. The dev tools and reference implementation are all Python, but the final product is C++.)


        So you could say that Windows OS encodings “infected” all of:

        1. Java and every JVM language
        2. JS, and every compile-to-JS language
        3. Python

        In contrast, newer languages like Go and Rust use a UTF-8 centric model, which is what Oil uses too.

        1. 2

          Not to mention this was the #1 motivation for, and cause of, the Python 2 -> 3 incompatibility

          Not exactly. Python went from byte strings with unspecified encoding and using ASCII by default for implicit conversions towards separating bytes and human readable strings into two types, and making the latter default. There was a short period when that was represented as UCS2 (which is UTF-16 without surrogates), but now it’s almost universally UCS4, which is UTF-32, and doesn’t have the problem with surrogates.

          Also, they explicitly left the encoding of a string as an implementation detail. They can switch it to UTF-8 without affecting userland (theoretically). And I’m all for it, since everything in UTF-8 proved to be an easier and more efficient model over time.

          1. 3

            CPython since 3.3 uses the PEP 393 approach, though, where the internal representation of a string is the narrowest encoding, out of the set (latin-1, UCS-2, UCS-4), capable of handling the widest code point of that particular string. This ensures that operations at the C level are always fixed-width, and avoids issues like surrogates or other artifacts of variable-width encoding leaking up to the programmer (as used to happen in “narrow” builds of pre-3.3 Python).

          2. 1

            Yeah there were some ASCII defaults that were bad, but my point is that Python could be like Go or Rust with respect to Unicode, and nothing would be lost.

            The strongest argument for UTF-16 and the like was “Windows works that way”, but AFAICT Go and Rust have working solutions for that.

        2. 2

          I’d be curious to know what you mean by “Python 3 is moving toward UTF-8 again”. PEP 393 is the last major effort I’m aware of, and that’s not at all what it did. It’s true that Python now defaults to assuming the filesystem, standard streams and other locale-y bits use UTF-8 (PEP 540 and then PEP 686 made it an always-on mode), but that doesn’t affect how Python internally stores strings or the fact that the str abstraction is a sequence of code points.

          1. 1

            Basically I mean that with PEP 540 and 686 making the APIs UTF-8, they might as well have kept bytes as the internal representation and avoided a lot of complexity from PEP 393 (flexible representations for space optimization).

            The argument is: What Python applications need O(1) random code point access as opposed to O(n) ? UTF-16 and UTF-32 give you O(1) code point access, at the expense of space. UTF-8 gives you O(n) code point access.

            This is an honest question: I can’t think of any code that relies on it and is correct.

            That is, code points basically don’t mean anything to applications – they are for libraries like ICU, which are most naturally written in C or C++.

            I think every simple operation you can do on code points in Python has corner cases that are wrong, and if you want to do it correctly, you need Unicode tables for graphemes and combining code points and so forth.


            Saying it more simply via my other comment:

            My point is that Python could be like Go or Rust with respect to Unicode, and nothing would be lost.

            The strongest argument for UTF-16 and the like was “Windows works that way”, but AFAICT Go and Rust have working solutions for that.

            They could have used the Python 2 -> 3 switch to adopt that behavior, rather than the very complex current behavior which is still not settled

            1. 2

              I am of the opinion that if code points are considered too dangerous to be the atomic unit of a string type, then the solution can never be to introduce an even more dangerous lower-level atomic unit like code units or bytes; the only solution is to go higher and make a string a sequence of graphemes.

              But for all the times people in threads like this one have told me that code points are useless, I actually maintain a library which would be incorrect if if didn’t treat strings as sequences of code points and perform operations like checking what the code point is at a particular index in a string. These are actually really common and vital operations in a lot of domains of programming. They’re also not that hard to implement correctly; insisting otherwise because there are “corner cases that are wrong” is, to me, like saying that because there are corner cases to names and mailing addresses, it should be forbidden to work with them. Yes, there are corner cases. You should know what they are up-front and whether they’ll affect you. But very often you actually can manage quite well without needing to implement hundreds of pages of specs.

              And since there is no lower-level atomic unit that reduces the number of corner cases (going lower only introduces more ways to break things), and since code points are the atomic units of Unicode, I’m perfectly fine with Python’s approach. But for the record, I also believe it is and should be considered factually incorrect to refer to a sequence of UTF-8-encoded bytes as a “string”, and I believe UTF-8 ’s ASCII compatibility is a terrible design mistake which comes close to making the whole encoding fundamentally broken, so take that as you will.

              1. 1

                I looked at the library:

                https://github.com/ubernostrum/webcolors/blob/trunk/src/webcolors.py

                If the HTML is encoded in UTF-8, then all of this can be done easily with Go or Python 2’s string type. If you have to handle multiple encodings (and browsers obviously do), then I can see why the unicode type is more convenient.

                But it’s not like you couldn’t do it in Python 2! It had a unicode type

                1. 1

                  It’s not about the encoding – it’s about the fact that the relevant web standards require the ability to work with the inputs as sequences of code points in order to correctly implement them. “Do this if the code point at this index is U+0023” simply doesn’t work unless you have a way to get code points by index. And a huge variety of data parsing and validation routines for web apps – which have to accept all their data as initially stringly-typed, regardless of the language running on the backend – run into stuff like this. So dismissing it as some sort of odd/rare use case that doesn’t need or deserve efficiency makes no sense to me.

                  And forcing everything to lower-level abstractions, as I mentioned, just introduces even more ways to mess things up. With a code-point abstraction, you can slice in the middle of a grapheme cluster; with a code-unit or byte abstraction you can slice in the middle of a code point. So you’re not gaining any correctness from it. And for Python’s case you’re not gaining on storage – the PEP 393 model can store many strings in less space than UTF-8 would.

                  But it still mostly comes back to the fact that I don’t believe bytes should be considered strings and thus that Python 3 made the correct choice.

      2. 2

        UTF-16 demonstrates of the dangers of design-by-committee and the dangers of publishing bad designs.

        UTF-16 needed an experienced, respected, industry veteran to say “No” often, frequently, and consistently. (The fact that some code points appear after other Unicode ordered code points mind boggling. The day the engineers that pass off on that, did they not yet have their coffee?)

        Unfortunately, the “No”s came too late, so UTF-8 was needed, and then quickly took over the web.

        I wish the Javascript communities and others had a stronger ethos of being careful with foundation design decisions and fixing foundational issues when acknowledged instead of living with them forever, which results in building more systems tightly coupled to bad designs.

      3. 2

        Yeah, I think windows still uses screwed up UTF-16. https://simonsapin.github.io/wtf-8/

        V8 used to use multiple string representations in different parts of the code, so I would guess they do not use UTF-16 internally for many things, but I cannot claim I know this.

        Here’s one reference, almost a decade old now, but I haven’t re-read it. https://blog.mozilla.org/javascript/2014/07/21/slimmer-and-faster-javascript-strings-in-firefox/

      4. 1

        also Objective-C/Foundation’s NSString

        This, in turn, was the inspiration for the API of Java’s String (many of the same people worked on both projects), which inspired C#‘s String. It’s very rare to find a high-level language that doesn’t think that 16 bits is sufficient for a ‘character’ (whatever it thinks that means). I believe Swift does this right, with the advantage of learning from the pain of NSString.

        1. 2

          Go does a decent job too, in its typically-minimalist way: a string is basically a distinct type of byte array, but there are APIs to access “runes”, i.e. codepoints.

          I sometimes suspect emoji were secretly invented by the Unicode consortium as a plot to get Western users to start using characters outside the BMP, so programmers would finally have to make their code support them properly.

      5. 1

        Good point. I am wondering if there were other strong use cases for the so-called “astral” characters besides emoji? They do have several sets that aren’t emoji. However, many of the additional planes are completely or partially unassigned.

        When it comes to storing, from my understanding, although the JavaScript standard mentions string characters as “UTF-16 code units” and this is how they are exposed to JavaScript developers, it does not mandate how those strings should be internally implemented. I believe most actually do store those differently.

        1. 2

          Whether there’s an existing use case doesn’t preclude that we might need them in the distant future. I think it’s good that the design has given itself room to grow without imposing an efficiency cost.

          1. 2

            Great point too. Private use areas (U+E000 to U+F8FF, and planes 15/16) are also good ideas in that direction.

        2. 1

          I am wondering if there were other strong use cases for the so-called “astral” characters besides emoji?

          CJK scripts?

    2. 1

      ITYM “how long is a piece of string?”