Worth noting that up until the conclusion, the article uses “Unicode” as a synonym for UCS-2, which was not unreasonable at the time, and the arguments mostly hold up; UCS-2 was replaced by UTF-16 and then (mostly) with UTF-8 because UCS-2 was fatally flawed.
He argues that simultaneously supporting UTF-8, UTF-16, and UTF-32 is infeasible, and that has been borne out. He just missed the fact that we ended up picking just one of these, and that’s worked out great.
Verisign recently opened a Pandora’s Box when the company stated that it was taking orders for URLs in the language particular to those countries which either desire or demand to work in a written set other than Latin1.
This statement betrays a fundamental confusion about domain names that makes me a little suspicious of the author’s technical grasp of the subject, though his historical and cultural insights seem sound.
Yeah, I was confused while reading this, because the crux of his argument was that “Unicode” only supported about 65k distinct characters which isn’t enough for every single possible hanzi in some dictionary. But I know that today Unicode supports something like one million possible code points, which is actually enough for every single possible hanzi, even uncommon ones. That change must have happened after he wrote this article (maybe precisely in order to solve the problem of supporting enough rare hanzi?)
Unicode’s original plan was to unify the world’s digital character sets; while the Chinese language might have fifty thousand characters, the Chinese government’s GB2312 character encoding contains only 6,763, a much more practical number.
After Unicode 1.x was released, ISO showed up and said “hey, that’s a nice universal character set, we were planning on making one too, why don’t we join forces? Of course, encoding digital character sets isn’t enough, we want to encode EVERY CHARACTER EVER!” And so Unicode 2.0 came out, introducing UTF-16, surrogate pairs, the Basic Multilingual Plane and all the astral planes so as to have enough room for everything.
A consequence that hadn’t soaked in for me is that you need to know the language of a piece of text to reliably display it correctly. I’m guessing Twitter and Facebook and such are assigning languages to comments, etc. based on content, the poster’s language preferences/location/Accept-Language, and who knows what else.
(And then if a user switches into not-their-native-language for a post, or mixes languages by quoting text in another language or whatever, there’s a whole other level of difficulty.)
Worth noting that up until the conclusion, the article uses “Unicode” as a synonym for UCS-2, which was not unreasonable at the time, and the arguments mostly hold up; UCS-2 was replaced by UTF-16 and then (mostly) with UTF-8 because UCS-2 was fatally flawed.
He argues that simultaneously supporting UTF-8, UTF-16, and UTF-32 is infeasible, and that has been borne out. He just missed the fact that we ended up picking just one of these, and that’s worked out great.
This statement betrays a fundamental confusion about domain names that makes me a little suspicious of the author’s technical grasp of the subject, though his historical and cultural insights seem sound.
I remember the big hubbub over multilingual domain names in 2000-02 that caused a bit of discomposure.
Multilingual domain names under fire (2000)
Reuter’s Business: VeriSign’s Multilingual Domains (2001)
Wikipedia Timeline
Yeah, I was confused while reading this, because the crux of his argument was that “Unicode” only supported about 65k distinct characters which isn’t enough for every single possible hanzi in some dictionary. But I know that today Unicode supports something like one million possible code points, which is actually enough for every single possible hanzi, even uncommon ones. That change must have happened after he wrote this article (maybe precisely in order to solve the problem of supporting enough rare hanzi?)
Unicode’s original plan was to unify the world’s digital character sets; while the Chinese language might have fifty thousand characters, the Chinese government’s GB2312 character encoding contains only 6,763, a much more practical number.
After Unicode 1.x was released, ISO showed up and said “hey, that’s a nice universal character set, we were planning on making one too, why don’t we join forces? Of course, encoding digital character sets isn’t enough, we want to encode EVERY CHARACTER EVER!” And so Unicode 2.0 came out, introducing UTF-16, surrogate pairs, the Basic Multilingual Plane and all the astral planes so as to have enough room for everything.
The main topic discussed here is now known as the Han Unification, for those curious to catch up with what’s happened since this was written.
A consequence that hadn’t soaked in for me is that you need to know the language of a piece of text to reliably display it correctly. I’m guessing Twitter and Facebook and such are assigning languages to comments, etc. based on content, the poster’s language preferences/location/Accept-Language, and who knows what else.
(And then if a user switches into not-their-native-language for a post, or mixes languages by quoting text in another language or whatever, there’s a whole other level of difficulty.)