Windows, OpenStep, and Java all decided to adopt unicode back when it fitted into 16-bit integers. They then all needed to switch to UTF-16 to keep backwards compatibility. The Java char and the OpenStep (Cocoa) unichar are 16 bits and this can’t be fixed without an ABI break.
In OpenStep, it’s less bad because, although the core primitives on NSString expose UTF-16 code units at the lowest-level APIs, many of the higher-level functions operate on strings and so can be more efficient if the source and destination are the same encoding.
The wchar_t thing predates unicode. Various non-unicode character sets such as Big5 could all be represented in 16 bits (the problem for unicode is that they can’t all be represented in 16 bits). This included some awful functions such as mbtowc in C89 that allowed you to convert between fixed-length encodings (such as ASCII, Big5) and variable-length ones (such as Shift JIS).
Early attempts to retrofit unicode onto C used wchar_t and overloaded these functions, but then struggled when unicode needed more than 16 bytes. The fun solution was to make the width of wchar_t implementation defined and added the __STDC_ISO_10646__ macro to advertise that wchar_t can support everything in the unicode character set (if I remember correctly, this is defined as the date of the unicode character set, so if the emoji insanity continues we can bump it to being a 64-bit type in future systems). Unfortunately, that was an ABI break on Windows, and so Windows retains a 16-bit wchar_t. Most of the Windows system APIs take UTF-16, though increasingly there are variants that take UTF-8 instead.
There are some benefits to UTF-16. Most notably, CJK characters usually fit in a single UTF-16 code unit, whereas they typically require three UTF-8 code units, so are 50% larger. This can have a big impact on cache usage if you’re processing a lot of text. If you’re mostly processing European languages, the converse applies (as I recall, even things like Greek, Hebrew, and Arabic typically require the same number of bytes for letters in UTF-8 and UTF-16, but win in UTF-8 because spaces and punctuation characters are smaller). And this is why it’s a good idea to make your string type independent of the storage representation. Unfortunately, this is not possible for C, because C doesn’t do abstractions over data types.
so if the emoji insanity continues we can bump it to being a 64-bit type in future systems
The reason Unicode has the very weird 21 bit width (well, 0x10FFFF) is because that’s what you can fit in a UTF-16 surrogate pair.
Before ISO 10646 and Unicode merged to form Unicode 2.0 in 1996, ISO 10646 was a 31 bit character set, to fit in a positive signed 32 bit int. This serializes quite neatly in 6 bytes of UTF-8, 1111_110x then 10xx_xxxx five times.
64 bits would be uncomfortably large for UTF-8: the unary length counter has to spill into the second byte. 63 bits fits in 12 bytes, 1111_1111 1111_0xxx then 10xx_xxxx ten times.
I fear to contemplate an emoji picker with 1e18 glyphs 😱
There are some benefits to UTF-16. Most notably, CJK characters usually fit in a single UTF-16 code unit
This isn’t so simple. When the text is mixed with markup, the cost of ASCII markup is doubled. For things like Web pages and JSON APIs, this makes UTF-8 often the better choice, even for CJK languages.
Another thing to consider is that logographic scripts are more information-dense per character, so in absolute terms they’re not that disadvantaged by UTF-8 compared to English. There are Latin-script-based languages that are much worse off, in both UTF-8 and UTF-16, due to using non-ASCII letters:
From engineering perspective, I’m not convinced that supporting multiple encodings is beneficial, especially for a common general-purpose string primitive:
There’s a cost of extra run-time checks and branches for supporting a mix of encodings.
You double your string processing code size (or more if you implement specialized functions for mixes of encodings).
More complex string functions are less likely to be inlineable.
Mixing of encodings has high a cost where different types need to be converted to something common. Instead of naive memcpy/memcmp, you may need to perform parallel codepoint-wise iteration. This affects low-level sorting and equality checks (like hashmaps/btreemaps). There are places where such bytewise shortcuts are inappropriate, but these are typically higher-level problems like Unicode normalization, locale-dependent collation, and grapheme clusters that will need special handling regardless of the encoding.
For really large amounts of text or offline storage, you’d save more by using a real compression algorithm, and the size difference between Unicode encodings largely disappears after compression.
One thing you missed is endianness issues. There are two kinds of UTF-16, namely UTF-16-LE and UTF-16-BE. Both technically need to be marked with a BOM (byte-order-mark).
UTF-8 should be the standard way to go, it doesn’t need a BOM. UTF-16 needs to die, for the reasons you already specified.
Windows, OpenStep, and Java all decided to adopt unicode back when it fitted into 16-bit integers.
But also, it was engineers from American tech companies who decided that 16-bit ought to be enough. So they had the “foresight” to be early adopters, but it was a problem because they didn’t have the common sense to realize 16-bit wasn’t going to be enough.
The 1989 plan for Unicode was to include only contemporary character sets, use unified Han characters. Even Han Unification wasn’t controversial in that setup, because it was normal back then for fonts to be language-specific.
The 1989 plan for Unicode was to include only contemporary character sets, use unified Han characters. Even Han Unification wasn’t controversial in that setup, because it was normal back then for fonts to be language-specific.
And today, with unification, there are more Han characters than can fit in 16 bits! Yes, there are many rare ones, but people might’ve been miffed if they couldn’t use the rare character that’s in their name when using digital systems.
because it was normal back then for fonts to be language-specific.
CJK fonts are still language-specific, and if your text is tagged as a particular language, it’ll use the right font. Easy enough to do in document formats, harder in plain text.
CJK fonts are still language-specific, and if your text is tagged as a particular language, it’ll use the right font. Easy enough to do in document formats, harder in plain text.
Right but that’s a side effect from that: because langage specific fonts were normal, cjk unification keeping that requirement was not considered a massive issue.
And in a way it still is not as advanced text engines can mix and match fonts but in a world where multilingual documents has become routine the unification itself is a problem.
I feel like that’s an understandable mistake. Big5 was 16 bits and you’d be forgiven for assuming that, to a first approximation, everything that isn’t Chinese characters is a rounding error. Latin, Greek, Arabic and Hebrew us a small corner of the encoding space. But when you put in every character set that’s ever been used and then make up a load of new shiny ones because you let in a few for compatibility with a Japanese dumbphone character set, you realise that 16 bits isn’t quite enough for anyone.
Emoji and historic scripts are also a rounding error compared to Chinese characters… There’s like 100k Han codepoints in Unicode, and 11k Hangul codepoints.
Windows, Java, JS, Apple recently moved away from this mistake, C#/.NET, Language Server Protocol was originally UTF-16 based, surprisingly a lot of stuff still uses that abomination.
I was gonna say the same thing! wide characters in 2024? When we have UTF-8? What in the EBCDIC?
But this is probably still needed if you want to interface with the windows API. Afaik some but not all of their million syscalls have support for UTF-8 for all of their million parameters.
Between the JVM and JavaScript, it seems we’re going to be stuck with UTF-16 for a while. (Not that I don’t support standardizing on UTF-8 where possible!)
The real mistake is using wide chars (UTF-16). Who does this? Is it a Windows thing?
Windows, OpenStep, and Java all decided to adopt unicode back when it fitted into 16-bit integers. They then all needed to switch to UTF-16 to keep backwards compatibility. The Java
charand the OpenStep (Cocoa)unicharare 16 bits and this can’t be fixed without an ABI break.In OpenStep, it’s less bad because, although the core primitives on
NSStringexpose UTF-16 code units at the lowest-level APIs, many of the higher-level functions operate on strings and so can be more efficient if the source and destination are the same encoding.The
wchar_tthing predates unicode. Various non-unicode character sets such as Big5 could all be represented in 16 bits (the problem for unicode is that they can’t all be represented in 16 bits). This included some awful functions such asmbtowcin C89 that allowed you to convert between fixed-length encodings (such as ASCII, Big5) and variable-length ones (such as Shift JIS).Early attempts to retrofit unicode onto C used
wchar_tand overloaded these functions, but then struggled when unicode needed more than 16 bytes. The fun solution was to make the width ofwchar_timplementation defined and added the__STDC_ISO_10646__macro to advertise thatwchar_tcan support everything in the unicode character set (if I remember correctly, this is defined as the date of the unicode character set, so if the emoji insanity continues we can bump it to being a 64-bit type in future systems). Unfortunately, that was an ABI break on Windows, and so Windows retains a 16-bitwchar_t. Most of the Windows system APIs take UTF-16, though increasingly there are variants that take UTF-8 instead.There are some benefits to UTF-16. Most notably, CJK characters usually fit in a single UTF-16 code unit, whereas they typically require three UTF-8 code units, so are 50% larger. This can have a big impact on cache usage if you’re processing a lot of text. If you’re mostly processing European languages, the converse applies (as I recall, even things like Greek, Hebrew, and Arabic typically require the same number of bytes for letters in UTF-8 and UTF-16, but win in UTF-8 because spaces and punctuation characters are smaller). And this is why it’s a good idea to make your string type independent of the storage representation. Unfortunately, this is not possible for C, because C doesn’t do abstractions over data types.
The reason Unicode has the very weird 21 bit width (well, 0x10FFFF) is because that’s what you can fit in a UTF-16 surrogate pair.
Before ISO 10646 and Unicode merged to form Unicode 2.0 in 1996, ISO 10646 was a 31 bit character set, to fit in a positive signed 32 bit int. This serializes quite neatly in 6 bytes of UTF-8, 1111_110x then 10xx_xxxx five times.
64 bits would be uncomfortably large for UTF-8: the unary length counter has to spill into the second byte. 63 bits fits in 12 bytes, 1111_1111 1111_0xxx then 10xx_xxxx ten times.
I fear to contemplate an emoji picker with 1e18 glyphs 😱
I’m now wondering if someone’s packing three 21-bit Unicode characters into a 64-bit word. You even have a spare bit for an additional signal!
PS. I dunno how to tell the difference between an 8 byte or a 9-byte UTF-8 character 🤔
This isn’t so simple. When the text is mixed with markup, the cost of ASCII markup is doubled. For things like Web pages and JSON APIs, this makes UTF-8 often the better choice, even for CJK languages.
Another thing to consider is that logographic scripts are more information-dense per character, so in absolute terms they’re not that disadvantaged by UTF-8 compared to English. There are Latin-script-based languages that are much worse off, in both UTF-8 and UTF-16, due to using non-ASCII letters:
https://hsivonen.fi/string-length/#counts_wrapper
From engineering perspective, I’m not convinced that supporting multiple encodings is beneficial, especially for a common general-purpose string primitive:
memcpy/memcmp, you may need to perform parallel codepoint-wise iteration. This affects low-level sorting and equality checks (like hashmaps/btreemaps). There are places where such bytewise shortcuts are inappropriate, but these are typically higher-level problems like Unicode normalization, locale-dependent collation, and grapheme clusters that will need special handling regardless of the encoding.One thing you missed is endianness issues. There are two kinds of UTF-16, namely UTF-16-LE and UTF-16-BE. Both technically need to be marked with a BOM (byte-order-mark).
UTF-8 should be the standard way to go, it doesn’t need a BOM. UTF-16 needs to die, for the reasons you already specified.
But also, it was engineers from American tech companies who decided that 16-bit ought to be enough. So they had the “foresight” to be early adopters, but it was a problem because they didn’t have the common sense to realize 16-bit wasn’t going to be enough.
It’s more complex than just lack of common sense: https://www.unicode.org/history/summary.html
The 1989 plan for Unicode was to include only contemporary character sets, use unified Han characters. Even Han Unification wasn’t controversial in that setup, because it was normal back then for fonts to be language-specific.
And today, with unification, there are more Han characters than can fit in 16 bits! Yes, there are many rare ones, but people might’ve been miffed if they couldn’t use the rare character that’s in their name when using digital systems.
CJK fonts are still language-specific, and if your text is tagged as a particular language, it’ll use the right font. Easy enough to do in document formats, harder in plain text.
Right but that’s a side effect from that: because langage specific fonts were normal, cjk unification keeping that requirement was not considered a massive issue.
And in a way it still is not as advanced text engines can mix and match fonts but in a world where multilingual documents has become routine the unification itself is a problem.
I feel like that’s an understandable mistake. Big5 was 16 bits and you’d be forgiven for assuming that, to a first approximation, everything that isn’t Chinese characters is a rounding error. Latin, Greek, Arabic and Hebrew us a small corner of the encoding space. But when you put in every character set that’s ever been used and then make up a load of new shiny ones because you let in a few for compatibility with a Japanese dumbphone character set, you realise that 16 bits isn’t quite enough for anyone.
Emoji and historic scripts are also a rounding error compared to Chinese characters… There’s like 100k Han codepoints in Unicode, and 11k Hangul codepoints.
It is. Also Java.
Windows, Java, JS, Apple recently moved away from this mistake, C#/.NET, Language Server Protocol was originally UTF-16 based, surprisingly a lot of stuff still uses that abomination.
I was gonna say the same thing! wide characters in 2024? When we have UTF-8? What in the EBCDIC?
But this is probably still needed if you want to interface with the windows API. Afaik some but not all of their million syscalls have support for UTF-8 for all of their million parameters.
Or the JVM, one of the most widely used platforms in the world.
Between the JVM and JavaScript, it seems we’re going to be stuck with UTF-16 for a while. (Not that I don’t support standardizing on UTF-8 where possible!)