1. 16
  1.  

  2. 9

    Fixed size encoding lives on in the usually-hilariously-impractical UTF-32

    I know the prologue touches on this, but since a single codepoint it almost never the thing you want when manipulating Unicode text, even UTF32 isn’t meaningfully fixed size.

    1. 7

      With full hindsight, it’s easy to say that the Unicode consortium’s original scope - only encoding characters in current use - was doomed from the beginning:

      • As study of historical scripts become digitalized, every historical character will become a character in current use.
      • And once Unicode succeeds in superseding every other character encodings, nobody will want to go back to creating special character sets and encodings for historical scripts.

      In a way Unicode became a victim of its own success - if it didn’t actually become the character set for every character, there wouldn’t be a push for it to encode every character…

      1. 4

        Not to mention the need to invent new codepoints for every piece of fruit discovered

        1. 6

          Do you mean emojis? A lot of early emojis actually originate not from Apple, but from Japanese carriers - see here. If Unicode rejected those, the Japanese carriers would continue to use their own encodings and never adopt Unicode, which is not something the Unicode consortium wished to happen.

          Now that Unicode has become the universal character set, they aren’t really in a good position to be picky. Apple doesn’t need to convince the Unicode consortium that a new emoji is justified; they just need to threaten to use their private encoding if Unicode doesn’t include it. The dynamic is somewhat comparable to web standards.

          1. 3

            A lot of early emojis actually originate not from Apple,

            Do any originate from Apple? I thought it was a bit more distributed than that

      2. 3

        The major JS engines also do the latin1 optimization, partially for space, but also performance.

        1. 2

          Python as of 3.3 does something similar: all strings in Python have fixed-width storage in memory, because the choice of how to encode a string in memory is done per-string-object, and can choose between latin-1, UCS-2, or UCS-4.

          1. 7

            Before 3.3, Python would have to be compiled for either UCS-2, or UCS-4, leading to hilarious “it works on my machine”-bugs.

            And let’s not forget MySQL, which has a utf8 encoding that somehow only understands the basic multilingual plane, and utf8mb4, which is real utf-8.

            1. 7

              And let’s not forget MySQL, which has a utf8 encoding that somehow only understands the basic multilingual plane, and utf8mb4, which is real utf-8.

              The more I hear about MySQL the more scared I get. Why is anyone using it still?

              1. 5

                Because once upon a time it was easier than PostgreSQL to get started with, and faster in its default, hilariously bad configuration (you could configure it not to be hilariously bad, but then its performance was worse).

                And then folks just continued using it, because it was the thing they used.

                I still cringe when I see a project which supports MySQL, or worse only MySQL, but it is a mostly decent database today, if you know what you are doing and how to avoid its pitfalls.

                1. 1

                  I still cringe when I see a project which supports MySQL, or worse only MySQL, but it is a mostly decent database today, if you know what you are doing and how to avoid its pitfalls.

                  I’ve probably only heard of MySQL’s warts and footguns, and little of it’s merits. On the other hand, I’ve self-hosted wordpress for a great number of years so It Has Worked On My Machine(tm).

                2. 4

                  Because you’re hearing about the warts, it’s just legacy and now deprecated stuff they didn’t change for all the people that don’t want a broken system. Otherwise it works perfectly fine.

                  Edit: You could probably ask the same about windows, looking at WTF-8

                  1. 2

                    Legacy and/or confusion

                    1. 1

                      I’m no fan of MySQL, but Postgres also has some awful warts. Today I found a query that took 14s as planned by the planner or 0.2s if I turned off nested loop joins. There’s no proper way to hint that for that join, I have to turn off nested loops for the whole query.

                    2. 3

                      Another thing about pre-3.3 Python is that “narrow” (UCS-2) builds broke the abstraction of str being a sequence of code points; instead it was a sequence of code units, and exposed raw surrogates to the programmer (the same way Java, JavaScript, and other UTF-16-based languages still commonly do).

                      1. 2

                        basic multilingual

                        It’s still 3 bytes, making for even more fun as it first looks like its working.

                      2. 2

                        Interesting. Why did they not choose UTF-8 instead of latin-1?

                        1. 2

                          The idea is to only use it for strings that can be represented by one-byte characters, so UTF-8 doesn’t gain you anything there. In fact, UTF-8 can only represent the first 128 characters with one byte, whereas latin-1 will obviously represent the full 256 characters in that one byte (although whether CPython in particular still uses latin1 for \u007F-\u00FF, I’m not sure - it’s a little more complicated internally due to compat with C extensions and such).

                          1. 2

                            Like the other commenter said: efficiency. UTF-8 (really, ASCII at that point) in the one-byte range only uses 7 bits and so can only encode the first 128 code points as one byte, while latin-1 uses the full 8 bits and can encode 256 code points as one byte, giving you a bigger range (for Western European scripts) of code points that can be represented in one-byte encoding.

                            1. 1

                              Because it’s easier to make UTF-16 from latin-1 than UTF-8. Latin-1 maps 1:1 to the first 256 codepoints, so you just insert zero every other byte. UTF-8 requires bit-twiddling.

                              And these engines can’t just use UTF-8 for everything, because constant-time indexing into UTF-16 code units (and surrogates) has been accidentally exposed in public APIs.

                        2. 3

                          JavaScript inherits Java’s UTF-16 conundrum since it faithfully steal- er, borrows many things from Java.

                          Hm, this gives me a pause. I am not a PL historian, but it seems to me that JavaScript is mostly independent from Java in all by the name. Can someone weigh in on the general topic of JS borrowing things from Java and on UTF-16 specifically?

                          1. 4

                            I think JavaScript’s “Date” class is usually “blamed on” Java’s design … however I don’t know all the detailed history

                            I think there was more influence than the name. The syntax was certainly influenced by Java.

                            I believe Brendan Eich said he was handed a bunch of directives after Netscape and Sun executives like Bill Joy strategized on the platform. Back in those days executives very much cared about programming languages

                            There are more details in this paper / video:

                            http://www.oilshell.org/blog/2021/12/backlog-project.html#three-analogies-dont-break-x

                            1. 3

                              As I remember it, the original idea was to use a Scheme-Like language for scripting the web, but the higher-ups at Netscape wanted to ride the popularity/hype coattails of Java, both in terms of name and familiarity of syntax etc. Obviously the likeness is only skin-deep as JS is a dynamic language, but you can also see its heritage in the naming of many of the original core builtin object types and their method names, famously Date but also methods on strings, arrays, etc.

                              As for UTF-16, both JavaScript and Java predate it; as far as I was aware, they were originally intended to be using UCS-2 - just like pretty much any new platform development at the time - which I guess is why they don’t use code points as the building blocks for strings but rather leave it to the programmer to deal with surrogate pairs.

                              1. 1

                                Not sure whether it took UTF-16 from Java or from Windows. Could be either, really.

                              2. 4

                                Not mentioned: han unification. Take a gander.