1. 21
  1.  

  2. 6

    This is a pretty great write-up, though it’s important to note that “fixed width character” is a nonsense idea in the Unicode world anyway. Even UCS4 (UTF32) is only fixed-width codepoint, but a “character” is mostly a non-thing and a single codepoint certainly doesn’t always map to what a programmer or user might think they mean when they say “character”.

    1. 5

      Love it. Finally a write up that describes how we got to where we are now, along with a good explanation of Unicode jargon you might bump into.

      1. 4

        If you liked this you might like https://simonsapin.github.io/wtf-8/ - most recently discussed in https://lobste.rs/s/kcuhls/wtf_8_encoding

        1. 4

          One minor detail: the Unicode Consortium didn’t suddenly decide to encode all characters ever, they were approached by the ISO10646 working group that was also working on a universal character set, and they decided to join forces. That’s why everything gets published twice (online by the Consortium, on paper by ISO), and reviewed twice (technically by the Consortium, and politically by ISO). Also, there’s occasionally differences between the two standards; I don’t know if it’s still the case, but at one point ISO UTF-8 was not capped at U+10FFFF, making it much more spacious than the Consortium version.

          1. 2

            Recently I’ve been thinking that a lot of these encoding problems could have been solved/prevented if there was some way of telling which character encoding a piece of text uses. For example, if the first byte (or two bytes) of every string was used to identify the encoding. e.g. 0x01 would be ASCII, 0x02 ISO-8859-1, etc.

            With this, there would be no need for encoding anything with backward-compatible hacks like UTF-8 or UTF-16: old character encodings would be marked as “ASCII” or “UCS-2” and keep working just fine. It would also mean it’ll be easier to “use the right encoding for the job”.

            This would require updating a lot of tools and APIs; but not really more so than making everything “8 bit clean” or “unicode clean”.

            The currently popular solution is “UTF-8 everywhere”, but I’m left wondering if we’ll be reading “The Tragedy of UTF-8” in 20 years time…

            1. 3

              This breaks down as soon as you start concatenating strings: greek_text + cyrillic_text

              • If you insist on unifying both to a single encoding, there’s no codepage that handles both, so you have a half-garbage string.

              • If you preserve both codepage markers, now you have markers in the middle of a string. Processed text may end up with any number of markers anywhere, so you end up with a variable-width encoding that is more annoying than UTF-8 with none of its advantages.

              1. 2

                This is exactly what some OSes do already. Conversions can be performed transparently when opening files based on what’s requested versus what’s marked about the file. It does mean some complications when storing files, but the ones from munging strings together at runtime would probably be the same:

                                               Display Attributes                               
                                                                                                
                 Object . . . . . . :   /home/calvin/ccsid.c                                    
                                                                                                
                 Type . . . . . . . . . . . . . . . . . :   STMF                                
                                                                                                
                [...]
                                                                                                
                 Coded character set ID . . . . . . . . :   819                                 
                [...]                                                                   
                
                1. 2

                  Windows already does this with a byte order mark, UNIX doesn’t because it outright ruins a lot of utilities. Is that first byte a byte order mark or two ASCII characters? etc. etc.

                  It’s been 20 years and UTF-8 has been the standard for those 20 years, except when it’s transcoded to UTF-16 for internal use. I don’t envision seeing the death of UTF-8 any time soon, even if I personally have some huge problems with certain aspects of it (like how emoji and flags are constructed and how even transcoding to UTF-16 and normalizing it doesn’t save you from having to deal with multi-graphene -> single visible character combinators).

                  1. 2

                    I mean… There are a lot of registries of encodings, and a lot of standards for attaching an identifier from those registries to a piece of text. There’s enough, in fact, that we could reasonably benefit from a registry of registries… I’m not convinced this is a tractable problem.

                  2. 2

                    One theoretical Windows-based solution would be to add UTF-8 as a “legacy” multi-byte character set, since the Windows API is familar with those and has functions to comprehend them. The problem is you’d have to use the non-wide APIs to do so, AFAIK.

                    1. 2

                      It’s always bothered me that the original 16 bit Unicode was designed to be implementationally simple at the cost of inefficiency. At least from a Windows perspective, it first arrived in 1993 with an OS that needed 12Mb RAM to function, which was a huge number at the time. What Windows did was export all APIs twice: an “A” version that takes an 8 bit ANSI string, and a “W” version that took a 16 bit Unicode string (later redefined as UCS-2.) Internally the system uses UCS-2, in order to have fixed sized characters.

                      In that context, it would have been fairly straightforward to also have an “H” version that talks UCS-4, or even a “U” version that talks UTF-8, and increase the internal representation to UCS-4. Given a choice, I’d far sooner write software with fixed sized characters, so UCS-4 is a fairly natural choice.

                      By being unwilling to accept 32 bit chars at the platform level, we’ve instead landed in a place where some parts are UCS-2, and some are UTF-16, and developers are trying to write in UTF-8 which they need to convert to a 16 bit char thing to call any API because the 8 bit API is not UTF-8, and then spend a huge amount of time trying to find what didn’t encode correctly. There’s not much point using UCS-4 in an application, because that needs to be thunked down to UTF-16/UCS-2 to call any API, and the whole reason do use UCS-4 is to eliminate the ambiguity of not knowing whether a particular API (on a particular OS version/configuration) is really UTF-16 or UCS-2.

                      I’m still writing tons of UCS-2 code because it’s the thing that I know the platform will actually implement correctly. Anything past that is still YMMV.

                      1. 2

                        We were back where we started; one character is no longer one character, as UCS-2 promised … What about the dream of a fixed character size?

                        OP seems to be suggesting that “one codepoint == one user perceived character,” which is oversimplified.

                        https://begriffs.com/posts/2019-05-23-unicode-icu.html#what-is-a-character

                        Many graphemes are formed from a combining sequence of codepoints, so it doesn’t help too much even when you manipulate strings as UTF-32. Any application which treats Unicode strings as more than an opaque byte stream must embrace the complexity and iterate/manipulate them using functions from a mature library.

                        1. 2

                          Eh, the dream of a 16-bit character encoding was misguided from the start. You need like 40,000 characters to represent all of Chinese script… but Japanese and (legacy) Korean use Chinese characters also, which are not necessarily the same as the Chinese versions. Then you get into traditional vs simplified Chinese. Unicode deals with this via the hack of Han unification, which afaik makes nobody happy anyway. So a 16 bit encoding would need to be variable length anyway, and Unicode would have been better off realizing that from the beginning