Always a Raymond Chan fan, but if you don’t read the comments after the post, you’re missing out, there’s an interesting discussion there too about what might be technically correct vs what the users actually want.
The more I learn about Unicode, the more problems I see with the design.
For example, the Hungarian alphabet considers “d”, “z” and “dz” to be separate letters. That’s fine, it’s a convention used for sorting Hungarian words in alphabetical order, but it doesn’t logically follow that Unicode should encode the Hungarian letter “dz” differently from the Hungarian 2-character string “dz”. That seems wrong, because “dz” and “dz” are not considered to be different things in Hungarian orthography. I am not Hungarian, but I looked at a bunch of Hungarian keyboards online, and I don’t see a “dz” key separate from the “d” key and the “z” key. I also don’t see an extra “title case” shift key that you hold down when pressing the non-existent dz key in order to get the letter Dz. So at this point I don’t have any reason to think that anybody actually uses the Unicode “dz” character when inputting Hungarian. It’s obviously easier to just type “d” and then “z” than to use some other alternative, such as using a Linux compose key sequence, or some dead key sequence. I checked the X11 UTF-8 compose key table, and I can’t find compose sequences for dz, Dz, DZ. I checked the X11 hungarian keyboard layouts, didn’t find dz there either. That provides more evidence that nobody uses these characters.
Aside from adding unnecessary complexity to solve non-existent problems, this decision by the Unicode committee also introduces bugs into every piece of software that assumes there are two cases for Latin characters, upper and lower, and it’s a security hole, because an adversary can replace the two character string “dz” with the Unicode “dz” character that is indistinguishable from “dz”, and use it to trick people and software in various well-known ways.
This is just my initial impression. I would love to hear from an expert explaining why this feature of Unicode is actually a good idea.
Another comment explains this: the digraphs originate in the Yugoslavian YUSCII 8-bit character set. Unicode has a policy of ensuring lossless round trips by including code points for anything that is representably by a single code point in any existing character set (which is how emoji were let in).
This is a laudable goal, because without it you’d have a lot of transcription errors. If you can’t losslessly convert YUSCII to a Unicode encoding (such as UTF-8) and back again and have an identical bit pattern at the end, that makes moving from YUSCII to Unicode risky, to a degree that would have prevented adoption.
You can, of course, do things like Unicode normalisation after that conversion if you don’t want to round trip but do want string comparisons to behave sensibly.
I can also confirm it as a native speaker. I didn’t even know that it existed in Unicode, and no other (fortunately mostly superseded) encoding contained anything like that that supported Hungarian letters, like latin2. Also, it would be greatly unexpected if “madzag” wouldn’t be found for “mad”. As other commenter mentioned, it may be expected for Czech, but we don’t consider it that way. It’s only a unit in pronunciation and hyphenation, nowhere else.
Edit: thinking a bit more about it, I think that having this character is technically correct, from a linguistic pov, but still a bit blind to the way the language evolved with relation to typewriting and digitalization. Hungarian is close enough to ascii that early attempts were made to incorporate the truly distinct characters like ő, ű into some encoding and content was already being produced/read/digitalized into these encodings. This Unicode change would have been breaking as no naive conversation could take place from existing content, as dz next to each other may or may not be this specific letter. Also, the original reason these letters came to be was also due to the Latin alphabet not covering every need of the language (basically no “font coverage” on typewriters I guess) so we simply made use of humans good fuzzy pattern matching.
This sounds very much like Polish rz / sz / cz / ch. They exist as separate sounds on their own and you wouldn’t hyphenate between the letters. But they’re definitely not letters/glyphs and adding them as such to unicode would be madness.
I wonder if this is an artifact of the Cold War - maybe the designers of the predecessors to Unicode (ISO and codepages) were influenced by linguists outside Hungary before the 90s and didn’t take the “real” conditions into account.
Unicode inherits a lot from old code pages. Indeed that was a design goal: every single code page should be able to move to Unicode. Straying from that has cause problems too. Like Asian texts that need language metadata in order to properly render the correct glyphs.
Yes, that’s right. Unicode is the way it is because of a lot of historical reasons, including the politics of working with many other standards organizations across the world, and creating a huge technical artifact with seemingly inadequate resources. It has been remarkably successful, given the challenges. The International Organization for Standardization (ISO) independently tried to create their own Universal Character Set standard, but Unicode won and is supported almost everywhere.
The inclusion of the dz,DZ,Dz characters seems more political than requirements based, since you apparently don’t use them when converting Hungarian text from old encodings to Unicode.
I can tell the article is by Raymond Chen just by the title. This is what genuine talent looks like.
As always interesting discussion, i think I’ll still the “lav”/“law” comparison, really makes it easier to understand why diacritics and digraphs are not the same as their latin counterpart or the separate letters
Until 1917 “å” was written “aa”, but that was deemed impractical since it could appear next to other a’s or å’s, giving words like “flaaar” (flayer) or even worse “pigghaaaatak” (attack of the spiny dogfish – as this representative sample of vocabulary indicates, Norwegians were primarly living off fishing at the time). Fortunately we only have salt-water sharks or we could get a haaaaaaatak (attack at Shark River, which would be different from a haaaaatak, an attack on/by the village of Håa).
Anyhoo, export LC_ALL=C.UTF-8 to side-step this weirdness.
Croat here, in practice we write them as separate characters dž instead of dž, etc. This makes correct aphabetical sorting like d -> dž -> đ -> e difficult to implement (and usually we don’t bother).
I wonder if there are words that have a d followed by a z in Hungarian where it is actually not a dz letter but two separate letters, one after another. How do you handle the search then, do you need a dictionary-aware search, as it’s based on knowing the language dictionary at search?
There are words like that, we can combine words quite freely like Germans, and for example bridge blockade could be translated literally into “hídzár”, from bridge=híd and zár=blockade. This is pronounced differently than the letter dz, but the article is kinda wrong in that no Hungarian has ever used that Unicode character, we simply write the two ASCII characters and decide which it is based on the word, just like we did at the time of typewrited books. So proper automatic handling simply can’t be naively done.
Titlecase is unreasonably complicated. This Unicode issues is one reason. Languages having vague or conflicting style guides for words to title case when is another, Turkish and Azeri having re-used code points from the Latin alphabet but capitalizing them differently is another. For the latter two issues I started working on decasify about a year ago, but I haven’t gotten to any languages with these digraphs yet. More fun to come. Yikes.
Always a Raymond Chan fan, but if you don’t read the comments after the post, you’re missing out, there’s an interesting discussion there too about what might be technically correct vs what the users actually want.
To amplify, I agree - that is a good post-article discussion well worth reading.
The more I learn about Unicode, the more problems I see with the design.
For example, the Hungarian alphabet considers “d”, “z” and “dz” to be separate letters. That’s fine, it’s a convention used for sorting Hungarian words in alphabetical order, but it doesn’t logically follow that Unicode should encode the Hungarian letter “dz” differently from the Hungarian 2-character string “dz”. That seems wrong, because “dz” and “dz” are not considered to be different things in Hungarian orthography. I am not Hungarian, but I looked at a bunch of Hungarian keyboards online, and I don’t see a “dz” key separate from the “d” key and the “z” key. I also don’t see an extra “title case” shift key that you hold down when pressing the non-existent dz key in order to get the letter Dz. So at this point I don’t have any reason to think that anybody actually uses the Unicode “dz” character when inputting Hungarian. It’s obviously easier to just type “d” and then “z” than to use some other alternative, such as using a Linux compose key sequence, or some dead key sequence. I checked the X11 UTF-8 compose key table, and I can’t find compose sequences for dz, Dz, DZ. I checked the X11 hungarian keyboard layouts, didn’t find dz there either. That provides more evidence that nobody uses these characters.
Aside from adding unnecessary complexity to solve non-existent problems, this decision by the Unicode committee also introduces bugs into every piece of software that assumes there are two cases for Latin characters, upper and lower, and it’s a security hole, because an adversary can replace the two character string “dz” with the Unicode “dz” character that is indistinguishable from “dz”, and use it to trick people and software in various well-known ways.
This is just my initial impression. I would love to hear from an expert explaining why this feature of Unicode is actually a good idea.
The comments in the submitted post seem to agree with you. The “dz” glyph does not seem to be used in Hungarian script.
Another comment explains this: the digraphs originate in the Yugoslavian YUSCII 8-bit character set. Unicode has a policy of ensuring lossless round trips by including code points for anything that is representably by a single code point in any existing character set (which is how emoji were let in).
This is a laudable goal, because without it you’d have a lot of transcription errors. If you can’t losslessly convert YUSCII to a Unicode encoding (such as UTF-8) and back again and have an identical bit pattern at the end, that makes moving from YUSCII to Unicode risky, to a degree that would have prevented adoption.
You can, of course, do things like Unicode normalisation after that conversion if you don’t want to round trip but do want string comparisons to behave sensibly.
I can also confirm it as a native speaker. I didn’t even know that it existed in Unicode, and no other (fortunately mostly superseded) encoding contained anything like that that supported Hungarian letters, like latin2. Also, it would be greatly unexpected if “madzag” wouldn’t be found for “mad”. As other commenter mentioned, it may be expected for Czech, but we don’t consider it that way. It’s only a unit in pronunciation and hyphenation, nowhere else.
Edit: thinking a bit more about it, I think that having this character is technically correct, from a linguistic pov, but still a bit blind to the way the language evolved with relation to typewriting and digitalization. Hungarian is close enough to ascii that early attempts were made to incorporate the truly distinct characters like ő, ű into some encoding and content was already being produced/read/digitalized into these encodings. This Unicode change would have been breaking as no naive conversation could take place from existing content, as dz next to each other may or may not be this specific letter. Also, the original reason these letters came to be was also due to the Latin alphabet not covering every need of the language (basically no “font coverage” on typewriters I guess) so we simply made use of humans good fuzzy pattern matching.
This sounds very much like Polish rz / sz / cz / ch. They exist as separate sounds on their own and you wouldn’t hyphenate between the letters. But they’re definitely not letters/glyphs and adding them as such to unicode would be madness.
I wonder if this is an artifact of the Cold War - maybe the designers of the predecessors to Unicode (ISO and codepages) were influenced by linguists outside Hungary before the 90s and didn’t take the “real” conditions into account.
Unicode inherits a lot from old code pages. Indeed that was a design goal: every single code page should be able to move to Unicode. Straying from that has cause problems too. Like Asian texts that need language metadata in order to properly render the correct glyphs.
Yes, that’s right. Unicode is the way it is because of a lot of historical reasons, including the politics of working with many other standards organizations across the world, and creating a huge technical artifact with seemingly inadequate resources. It has been remarkably successful, given the challenges. The International Organization for Standardization (ISO) independently tried to create their own Universal Character Set standard, but Unicode won and is supported almost everywhere.
The inclusion of the dz,DZ,Dz characters seems more political than requirements based, since you apparently don’t use them when converting Hungarian text from old encodings to Unicode.
I can tell the article is by Raymond Chen just by the title. This is what genuine talent looks like.
As always interesting discussion, i think I’ll still the “lav”/“law” comparison, really makes it easier to understand why diacritics and digraphs are not the same as their latin counterpart or the separate letters
Had the exact same thought, saw the title, saw devblogs on Microsoft and opened it immediately, never disappointed!
There’s something similar in Norwegian collation on Linux (possibly other systems too):
Until 1917 “å” was written “aa”, but that was deemed impractical since it could appear next to other a’s or å’s, giving words like “flaaar” (flayer) or even worse “pigghaaaatak” (attack of the spiny dogfish – as this representative sample of vocabulary indicates, Norwegians were primarly living off fishing at the time). Fortunately we only have salt-water sharks or we could get a haaaaaaatak (attack at Shark River, which would be different from a haaaaatak, an attack on/by the village of Håa).
Anyhoo,
export LC_ALL=C.UTF-8to side-step this weirdness.Croat here, in practice we write them as separate characters dž instead of dž, etc. This makes correct aphabetical sorting like d -> dž -> đ -> e difficult to implement (and usually we don’t bother).
A curious aside: this browser (Firefox on Android) is rendering the digraph with tighter kerning than the separated characters too.
I wonder if there are words that have a d followed by a z in Hungarian where it is actually not a dz letter but two separate letters, one after another. How do you handle the search then, do you need a dictionary-aware search, as it’s based on knowing the language dictionary at search?
There are words like that, we can combine words quite freely like Germans, and for example bridge blockade could be translated literally into “hídzár”, from bridge=híd and zár=blockade. This is pronounced differently than the letter dz, but the article is kinda wrong in that no Hungarian has ever used that Unicode character, we simply write the two ASCII characters and decide which it is based on the word, just like we did at the time of typewrited books. So proper automatic handling simply can’t be naively done.
Follow up reading on the subtleties of the other set of title case characters in Unicode: https://opoudjis.net/unicode/unicode_adscript.html
Titlecase is unreasonably complicated. This Unicode issues is one reason. Languages having vague or conflicting style guides for words to title case when is another, Turkish and Azeri having re-used code points from the Latin alphabet but capitalizing them differently is another. For the latter two issues I started working on decasify about a year ago, but I haven’t gotten to any languages with these digraphs yet. More fun to come. Yikes.