The sooner we all move to UTF-8 the better. There’s simply no reason to use anything else (except maybe UTF-32 for internal representations), and I dare you to give me an argument for any of the other encodings.
The only argument that is given in favour of UTF-16 over UTF-8 that is not immediately invalidated is that texts heavily written in some asian scripts would be smaller with UTF-16 than UTF-8, but this is actually not a good argument, because you would usually use some form of markup language (like HTML) that is mostly made up of ASCII, which gives UTF-8 the overall edge for a given document in said language.
The aspect that really crushes UTF-16 et. al. is the necessity for BOMs (byte-order-marks), and many many implementations omit them or handle them wrong. Don’t even get me started on surrogates (which ruined parts of the unicode spec because it had to reserve areas for them!).
Seriously, wherever you can, please use UTF-8 everywhere. And god bless Rob Pike and Ken Thompson for their stroke of genius while designing UTF-8.
Even the argument for UTF-32 as an internal representation is very suspect in my opinion. It allows you to treat all code points as the same size, but I’m not sure of any use case where that’s actually an advantage.
You still can’t treat every code point as its own atomic unit of text. You can’t delete individual code points. You can’t reverse a string based on its code points. One glyph can be made out of multiple code points; if you reverse the string “Hello 👋🏿” (“Hello <waving hand with dark skin tone>”), you end up with the string “🏿👋 olleH” (”<dark brown><waving hand with default yellow skin> olleH”). So UTF-32 kind of just seems like an extremely space-inefficient variable width text encoding.
I 100% support UTF-8 everywhere - even as an in-memory string representation.
I agree with you and this is why I made this the default in my grapheme cluster detection library (but you can also check for grapheme boundaries between two CPs “manually”).
In the end, if you really want to support grapheme clusters, you will have to deal with variable-length characters anyway. Many still consider codepoints and drawn characters to be equal.
UTF-32 is pretty bad for internal representations, too.
The CJK argument (when the argument is made, “Asian” really means CJK) for UTF-16 is not a very convincing one even without markup. You want to pick one encoding globally instead of choosing contextually. When measuring bytes used for the same amount of human-perceivable information content, UTF-8 isn’t particularly unfair to CJK. See the table at the end of https://hsivonen.fi/string-length/
If I am reading that table correctly, for any of the Chinese variants, the UTF-16 encoding is around two thirds the size of the UTF-8 encoding (a third the number of code units). The article is arguing something different: that a quota in terms of UTF-8 characters isn’t unfair to CJK languages because they encode more information per unicode code point, which makes up for requiring more bytes per code point than other encodings.
I don’t entirely buy @FRIGN’s argument for in-memory strings (though I’m willing to accept it for interchange), because when I process rich text, I don’t process it as HTML or similar, I process it as a string with metadata that is not stored as inline control characters and only serialise to to HTML (or RTF, or whatever) at the edge of the program. When I’m doing any processing on the text that doesn’t care about the metadata, then being able to fit larger strings in my L1 cache is a perf win (especially if I need to keep some tables for doing the extended grapheme cluster range calculation in the L1). There’s also a big win for using ASCII as an in-memory representation for things that can be losslessly stored as ASCII because knowing that up-front guarantees that one-byte = one unicode code unit = one grapheme cluster, which makes a lot of processing simpler.
For network bandwidth and persistent storage, UTF-8 is fine for two reasons:
Text is tiny in comparison to most other kinds of media. A picture is worth a thousand words. A video is worth a few million. Unless you’re storing huge amounts of text (e.g. all of Wikipedia)
None of the UTF-* variants is a compression algorithm. If you want to store a very large amount of text, use a compression algorithm. Even a fairly simple dictionary will give a huge win (English has around 20,000 words, with around 2,000 in common use. Without doing anything clever, you should be able to store most English sentences with 16 bits per word, which makes even UTF-8 for English look incredibly bloated).
When measuring bytes used for the same amount of human-perceivable information content, UTF-8 isn’t particularly unfair to CJK
UTF-8 is unfair to CJK glyph for glyph, while UTF-16 makes “western” glyphs the same size as CJK glyphs. However, CJK glyphs usually contain more information than “western” glyphs, so it’s “fair” to encode ASCII using one byte per glyph at the cost of using more bytes per glyph for CJK.
Analyzing the exact information content becomes difficult, but as a rough approximation, we can say that the average word length in English documents is around 5 ASCII characters, while 2-glyph Chinese words are extremely common. Therefore, it’s “fair” for Chinese glyphs to be encoded using around 2.5x as many bytes on average as English glyphs, because each Chinese glyph contains 2.5x the amount of information.
Again, this gets complicated and I don’t have expertise necessary to do a real analysis, but this should give an idea of why hsivonen claims UTF-8 isn’t particularly unfair to CJK.
I didn’t miss that, it’s exactly my point. If you are processing a lot of Chinese text, UTF-16 as an in-memory representation, will have better memory and cache performance by a fairly significant margin. If you’re producing per-user (per-message, or whatever) memory quotas then using a UTF-8 encoding won’t particularly penalise users who write in CJK languages relative to English. That’s an odd thing to focus on, because it matters for protocols with maximum-length messages, but doesn’t impact performance at all for most cases.
I agree, and https://www.oilshell.org/ is UTF-8 only, except where it calls libc, for say glob() or regexec().
There it inherits libc locales, which are messy and incoherent. They are unfortunately part of C and POSIX so I don’t think they’re ever going away.
The whole idea of a global variable in a PROGRAM makes no sense, let alone a library in a program. The encoding is a property of the DATA, not of the program that’s processing it!
In a non-networked world, you could imagine that say all the manuals on an entire Unix system are in a single encoding. But we’ve past that point by 30 years, so obviously you can have one file that’s UTF-8, and one file that’s UTF-16, and a shell has to look at them both.
Unlike HTTP, a Unix file system has no place for metadata. The only coherent solution is to use UTF-8, because you can perform almost all useful operations on it by treating it as a blob of bytes – in particular substring searching, like grep does, or like a shell parser does for keywords and operators (for, |, etc.).
grep and sort are also slowed down by an order of magnitude due to the locale, which annoys me. Compare LC_ALL=C sort to sort on most Linux systems.
The Good Thing about ye olde ascii was if you have a vast steaming pile of files created by a rambling ever changing herd of cats, you could read it, tweak it, write it back…. and the only thing that changed was the bits you tweaked.
Try that believing that the herd of cats have set all their editors to utf-8… Ha!
And you are bound to get slapped with an invalid code point exception…
Ok, so you then do some tedious magic to squash all invalid code points to a magic value, do your tweak…. and then you have unexpected deltas all over the place.
The sooner we all move to UTF-8 the better. There’s simply no reason to use anything else (except maybe UTF-32 for internal representations), and I dare you to give me an argument for any of the other encodings.
The only argument that is given in favour of UTF-16 over UTF-8 that is not immediately invalidated is that texts heavily written in some asian scripts would be smaller with UTF-16 than UTF-8, but this is actually not a good argument, because you would usually use some form of markup language (like HTML) that is mostly made up of ASCII, which gives UTF-8 the overall edge for a given document in said language.
The aspect that really crushes UTF-16 et. al. is the necessity for BOMs (byte-order-marks), and many many implementations omit them or handle them wrong. Don’t even get me started on surrogates (which ruined parts of the unicode spec because it had to reserve areas for them!).
Seriously, wherever you can, please use UTF-8 everywhere. And god bless Rob Pike and Ken Thompson for their stroke of genius while designing UTF-8.
Even the argument for UTF-32 as an internal representation is very suspect in my opinion. It allows you to treat all code points as the same size, but I’m not sure of any use case where that’s actually an advantage.
You still can’t treat every code point as its own atomic unit of text. You can’t delete individual code points. You can’t reverse a string based on its code points. One glyph can be made out of multiple code points; if you reverse the string “Hello 👋🏿” (“Hello <waving hand with dark skin tone>”), you end up with the string “🏿👋 olleH” (”<dark brown><waving hand with default yellow skin> olleH”). So UTF-32 kind of just seems like an extremely space-inefficient variable width text encoding.
I 100% support UTF-8 everywhere - even as an in-memory string representation.
I agree with you and this is why I made this the default in my grapheme cluster detection library (but you can also check for grapheme boundaries between two CPs “manually”).
In the end, if you really want to support grapheme clusters, you will have to deal with variable-length characters anyway. Many still consider codepoints and drawn characters to be equal.
UTF-32 is pretty bad for internal representations, too.
The CJK argument (when the argument is made, “Asian” really means CJK) for UTF-16 is not a very convincing one even without markup. You want to pick one encoding globally instead of choosing contextually. When measuring bytes used for the same amount of human-perceivable information content, UTF-8 isn’t particularly unfair to CJK. See the table at the end of https://hsivonen.fi/string-length/
If I am reading that table correctly, for any of the Chinese variants, the UTF-16 encoding is around two thirds the size of the UTF-8 encoding (a third the number of code units). The article is arguing something different: that a quota in terms of UTF-8 characters isn’t unfair to CJK languages because they encode more information per unicode code point, which makes up for requiring more bytes per code point than other encodings.
I don’t entirely buy @FRIGN’s argument for in-memory strings (though I’m willing to accept it for interchange), because when I process rich text, I don’t process it as HTML or similar, I process it as a string with metadata that is not stored as inline control characters and only serialise to to HTML (or RTF, or whatever) at the edge of the program. When I’m doing any processing on the text that doesn’t care about the metadata, then being able to fit larger strings in my L1 cache is a perf win (especially if I need to keep some tables for doing the extended grapheme cluster range calculation in the L1). There’s also a big win for using ASCII as an in-memory representation for things that can be losslessly stored as ASCII because knowing that up-front guarantees that one-byte = one unicode code unit = one grapheme cluster, which makes a lot of processing simpler.
For network bandwidth and persistent storage, UTF-8 is fine for two reasons:
I think you missed an important part:
UTF-8 is unfair to CJK glyph for glyph, while UTF-16 makes “western” glyphs the same size as CJK glyphs. However, CJK glyphs usually contain more information than “western” glyphs, so it’s “fair” to encode ASCII using one byte per glyph at the cost of using more bytes per glyph for CJK.
Analyzing the exact information content becomes difficult, but as a rough approximation, we can say that the average word length in English documents is around 5 ASCII characters, while 2-glyph Chinese words are extremely common. Therefore, it’s “fair” for Chinese glyphs to be encoded using around 2.5x as many bytes on average as English glyphs, because each Chinese glyph contains 2.5x the amount of information.
Again, this gets complicated and I don’t have expertise necessary to do a real analysis, but this should give an idea of why hsivonen claims UTF-8 isn’t particularly unfair to CJK.
I didn’t miss that, it’s exactly my point. If you are processing a lot of Chinese text, UTF-16 as an in-memory representation, will have better memory and cache performance by a fairly significant margin. If you’re producing per-user (per-message, or whatever) memory quotas then using a UTF-8 encoding won’t particularly penalise users who write in CJK languages relative to English. That’s an odd thing to focus on, because it matters for protocols with maximum-length messages, but doesn’t impact performance at all for most cases.
I agree, and https://www.oilshell.org/ is UTF-8 only, except where it calls libc, for say
glob()
orregexec()
.There it inherits libc locales, which are messy and incoherent. They are unfortunately part of C and POSIX so I don’t think they’re ever going away.
The whole idea of a global variable in a PROGRAM makes no sense, let alone a library in a program. The encoding is a property of the DATA, not of the program that’s processing it!
In a non-networked world, you could imagine that say all the manuals on an entire Unix system are in a single encoding. But we’ve past that point by 30 years, so obviously you can have one file that’s UTF-8, and one file that’s UTF-16, and a shell has to look at them both.
Unlike HTTP, a Unix file system has no place for metadata. The only coherent solution is to use UTF-8, because you can perform almost all useful operations on it by treating it as a blob of bytes – in particular substring searching, like
grep
does, or like a shell parser does for keywords and operators (for
,|
, etc.).grep
andsort
are also slowed down by an order of magnitude due to the locale, which annoys me. CompareLC_ALL=C sort
tosort
on most Linux systems.The Good Thing about ye olde ascii was if you have a vast steaming pile of files created by a rambling ever changing herd of cats, you could read it, tweak it, write it back…. and the only thing that changed was the bits you tweaked.
Try that believing that the herd of cats have set all their editors to utf-8… Ha!
And you are bound to get slapped with an invalid code point exception…
Ok, so you then do some tedious magic to squash all invalid code points to a magic value, do your tweak…. and then you have unexpected deltas all over the place.
Sigh.
With Vim at least, you can set ‘binary’ and it’ll leave arbitrary weirdness alone.
Conversely, that setting, no doubt, allows you to create non valid utf-8 weirdness.
Yep! Which is what you want when editing arbitrary buffers of bytes.
Sadly, the vast steaming pile needs to be linked into a cohesive product, so that answer didn’t work in the long run.
The solution is uchardet and iconv, uchardet to guess what encoding the cat had his editor set to… iconv to convert it to utf8.
Fix them all up.
Then set build tools to die noisily on invalid code point…
When cat gets unhappy build tools aren’t working… tell them about uchardet and iconv and remind them about the required encoding.
Tedious, but works.
This video is also great (more or less same story, bit less technical):
‘Characters, Symbols and the Unicode Miracle - Computerphile’ https://www.youtube.com/watch?v=MijmeoH9LT4
I think that site is having some issues right now.
https://web.archive.org/web/20210324065459/http://doc.cat-v.org/bell_labs/utf-8_history