1. 14
  1.  

  2. 4

    What would you, @FrostKiwi, suggest as an alternative to the Han Unification? Let’s imagine, for a moment, that Unicode doesn’t need to be backwards compliant.

    1. 6

      I’m not OP but I’ve been driven to distraction by the same issue. It’s frustrating that Unicode has 「A」「А」「Α」「𝝖」「𝞐」 but they forced unification of far more visually distinct CJK characters.

      IMO the Unicode consortium ought to introduce unambiguous variants of all CJK hanzi/kanji that got over-unified, and deprecate the ambiguously unified codepoints. Clients can continue to guess when they need to render an ambiguous codepoint (no better option), and new inputs wouldn’t have the same problem.

      Aside: My solution for Anki was to add a Japanese font (EPSON-KYOKASHO) to the media catalog and reference it in CSS. This works cross-platform, so I get consistent rendering on Android, macOS, and Ubuntu. The relevant card style is:

      @font-face {
       font-family: "EpsonKyoukasho";
       src: url("_EpsonKyoukasho.woff2") format("woff2");
      }
      
      1. 5

        I disagree. You can go into any used bookstore in Japan, and pull a book off the shelf with “Chinese” style pre-war kanji in it. It would be silly to say that those books should be encoded with completely different codepoints than post-war books. Yes, there were a few overly aggressive unifications in the first round, but those have all been disambiguated by now, and what’s left really are just font difference. If you were to approve splitting out all those font differences into new characters, then you’re on the road to separating out the various levels of cursive handwriting, which multiplies the number of characters by at least 4x, maybe more. The point of Unicode is not to preserve a single way that characters look. For that, we have PDF, SVG, etc. It’s just trying to preserve the semantics and handle pre-existing encodings. It’s a blurry line, for sure, but I think they did as well as they could have under the circumstances.

        Now emoji, on the other hand, that’s an ongoing disaster. :-)

        1. 3

          Splitting all the font differences would already cause an up to 4x multiplication just for characters used in mainland China, which has two font standards that apply to both traditional and simplified characters: https://en.wikipedia.org/wiki/Xin_zixing

          The size of the Unicode code space is 1,114,112 [1], and Han characters occupy 98,408 codepoints [2] right now (8.8%). A lot of those characters are rarely used, but unfortunately that also means they tend to be quite complex, made up of a lot of components, which increases their chances of them having (at least theoretical) regional variants. If Unicode tries to encode every regional variant of every character as separate codepoints, there might be a real danger of Han characters exhausting the Unicode code space.

          [1] https://en.wikipedia.org/wiki/Code_point [2] https://en.wikipedia.org/wiki/Script_(Unicode)

          1. 1

            Correction: it would cause a 2x multiplication, not 4x, for characters used in mainland China since simplified and traditional characters are already encoded separately.

          2. 2

            You can go into any used bookstore in Japan, and pull a book off the shelf with “Chinese” style pre-war kanji in it.

            Readers with a western background might be tempted to reason by analogy to 「a」 having two forms, or how 「S」 looks very different in print vs handwriting, and there are characters (e.g.「令」) for which there are multiple valid shapes, but CJK unification is more like combining Greek, Latin, and Cyrillic.

            I can walk into a bookstore in the USA and find a book written in Greek, but that doesn’t mean 「m」 and 「μ」should be the same codepoint and treated like a font issue. Iμaginε ρεaδing τεxτ wρiττεn likε τhiς! Sure it’s readable with a bit of effort, but it’s clearly not correct orthography.

            It would be silly to say that those books should be encoded with completely different codepoints than post-war books. […] If you were to approve splitting out all those font differences into new characters, then you’re on the road to separating out the various levels of cursive handwriting,

            I disagree with your position that character variants between Chinese and Japanese can be universally categorized as “font differences”.

            There are some characters like 「骨」 or 「薩」 for which that’s true, but once the stroke count or radicals become different then I think it’s reasonable to separate the characters into distinct codepoints. It’s especially important if you ever want to have both characters in the same sentence, since per-word language tagging is impractical in most authoring tools.

            Consider the case of mainland China’s post-war simplification. The simplified characters were assigned separate codepoints, even though following the “CJK unification logic” would have 「魚」 and 「鱼」 be assigned the same codepoint. The fact that I’m able to write that sentence in text (instead of linking pictures) is IMO evidence that separate codepoints are useful.

            Additionally, I think it’s worth pointing out that the Unicode approach to CJK has never been shy about redundant codepoints to represent semantic differences. If there’s separate codepoints for 「骨」「⾻」「⻣」, why is it that 「餅」 doesn’t have variants for both 「𩙿」 and 「飠」?

            1. 3

              I don’t know the people who decided on Han unification, but I imagine its starting point is perception of the users of Han characters, rather than abstract principles. Han characters are perceived as the same writing system with some regional differences by its users: indeed, Hanzi, Kanji, Hanja and Chữ Hán all mean “Chinese characters”.

              At the same time, Greek and Latin alphabets are perceived by their users to be different writing systems, and regional variants of the Latin alphabets are again perceived to be essentially the same writing system.

              You can argue how rational those perceptions are and where the line really “should” be drawn, but that doesn’t change the fact that there are almost universal perceptions in these cases.

              Consider the case of mainland China’s post-war simplification. The simplified characters were assigned separate codepoints, even though following the “CJK unification logic” would have 「魚」 and 「鱼」 be assigned the same codepoint. The fact that I’m able to write that sentence in text (instead of linking pictures) is IMO evidence that separate codepoints are useful.

              I am again going to point to perceptions, and in this case also China’s post-WW2 simplification. The character reform effort produced not just a traditional vs simplified table, but also an “old form” vs “new form” table to cover stylistic differences [1]. So again a perception exists that the simplified/traditional axis and the “old forms”/“new form” axis are distinct. In fact if you look at the “old form”/“new form” table, most of those variants are indeed unified in Unicode.

              If there’s separate codepoints for 「骨」「⾻」「⻣」

              One of the concessions of Han unification was “round trip integrity”; converting from a pre-existing Han encoding must not lose information. So if a source character set encoded two variants separately, Unicode also had to. This may be one of those cases.

              Or it was just a mistake. There are a lot of mistakes when it comes to encoding Han characters in Unicode.

              [1] See https://zh.wikipedia.org/wiki/新字形; I linked to the English version of the article in a sibling comment, but unfortunately the English page leaves out the most interesting part, the table itself.

              1. 2

                Yeah. A thought experiment is, should we undo “Roman unification” and have separate codepoints for English, French, and German? English orthography and French orthography are different! The French have ç and « and different spacing around punctuation. Uppercase and lowercase Roman letters are separate, but shouldn’t we also always use different codepoints for italics, since historically they evolved separately and only got added in later? A complication is that in practice, because Unicode incorporates different encodings, it ends up having a bunch of unsystematic repeats of the Roman alphabet, including italics, but they’re only used on places like Twitter because you can’t use rich text. Users prefer to think about these things as “the same letter” presented different ways.

                Specifically on the issue of stroke counts, stroke counts are not stable. People can count the same character differently, even when it is visually identical. There are lots of posts about this on Language Log. Here is a typical one: https://languagelog.ldc.upenn.edu/nll/?p=39875

          3. 1

            Thank you for your elaborate response! Why don’t you write a Unicode proposal? It’s not as unreachable as it might sound and they are usually open for suggestions. If you take your time, you might be able to “rise” in the ranks and make such a difference. These things can change the world.

            1. 3

              Thank you for your elaborate response! Why don’t you write a Unicode proposal?

              More knowledgable and motivated people than I have spent person-centuries trying to convince the Unicode consortium to change their approach to CJK, so far largely without success.

              If you take your time, you might be able to “rise” in the ranks and make such a difference.

              It’s difficult to express how little interest I have in arguing with a stifling bureaucracy to solve a problem caused by its own stubbornness. If I ever find myself with that much free time I’ll sign up for an MMORPG.

              1. 1

                This is very sad indeed…

          4. 3

            For me it’s rather straight forward. Unicode even defines blocks of characters that were found on an obscure ancient rock. [1] and some codepoint inclusions are so rare, as to exist only once in script, and even then it was written wrong and had to be corrected [2].

            And yet, it can’t properly capture a language, making unicode essentially “Implementation defined”. Not much of a standard then. If I look into the Source code of some webapp repos I have access to, there are over 100mb of just fonts files going: Japanese - Bold, Japanese - Normal, Japanese - Italics, Japanese - Italics Bold, Traditonal Chinese - Bold, Traditonal Chinese - Normal … you get the idea.

            We have space for obscure stuff, then we have the space to properly define a language and throw in the important sinographs, that cause the use of multiple font file copies being distributed just to support multiple asian languages. And all of that, just one new codepoint block away, which would even be backwards compatible. (Of course there are more modern Open Font formats which include all variants, saving just the regional different ones as a difference and zip compressing it all, but fixing this at the Unicode level is still in the realm of possibility I think)

            [ 1 - YouTube] ᚛ᚈᚑᚋ ᚄᚉᚑᚈᚈ᚜ and ᚛ᚑᚌᚐᚋ᚜

            [ 2 - Wikipedia ]

            1. 2

              What an unsatisfying situation we are in! Thanks for your elaborate response; this gave me great context!

              1. 1

                I’m not super confident about this, but the Unicode code space might not be big enough to contain all regional variants of all Han characters (https://lobste.rs/s/krune2/font_regional_variants_are_hard#c_1ccl2m).

                Also, if you open the floodgate of encoding regional variants, you probably also want to all historical variants, and that I’m sure will use up the Unicode code space…

                Unicode even defines blocks of characters that were found on an obscure ancient rock.

                The point is that they are distinct characters, not glyphs of known characters. Unicode won’t add codepoints when a new medieval manuscript has a new way to write the Latin letter a, for example.

                1. 1

                  The point is that they are distinct characters, not glyphs of known characters.

                  Ohh, now I get your point. There indeed is a difference.

                  I guess this boils down to certain sets of sinographs teetering on the edge of being a distinct character vs a different glyph of the same one. With one of the more egregious examples changing the number of strokes in the radicals𩙿vs飠 as mentioned above by @jmillikin. With my personal frustration being the radical 令 (order) being flipped in Chinese vs Japanese to what looks almost like a hand written version of 今 (now) like in the word 冷房. *valid, but bad example by me, as explained here

                  The job of the consortium is indeed not easy here. I’m just sad to see there being not enough exceptions, even though things like 𠮟る vs 叱る received new code points, thus amending Unicode after-the-fact, for crossing that line from glyph to distinct meaning, yet the more deceptively different looking examples did not.

                  1. 1

                    I guess this boils down to certain sets of sinographs teetering on the edge of being a distinct character vs a different glyph of the same one.

                    It’s true that a lot of cases straddle the line between different fonts and different characters, but for most of font variations (like the 冷房 example you gave) there is no question that they are the same characters.

                    Cross-language font differences really only appear significant when you consider the different standard typeset fonts. They are usually smaller than the difference between typeset and handwritten fonts, and often smaller than the difference between fonts from different eras in the same language.

                    1. 1

                      With my personal frustration being the radical 令 (order) being flipped in Chinese vs Japanese to what looks almost like a hand written version of 今 (now) like in the word 冷房.

                      If it’s any consolation, the right-hand version of 「冷」 is totally normal in typeset Japanese. I live in Tokyo and see it all the time (on advertisements, etc). You may want to practice recognizing both typeset and handwritten forms of some kanji, though when you practice writing it’s best to practice only the handwritten form.

                      If you’re learning Japanese and live overseas, Google Images can be a useful resource to find examples of kanji used in context:

                      https://www.google.com/search?q=冷房&source=lnms&tbm=isch

                      Example search result: https://item.rakuten.co.jp/arimas/518125/

                      You can also look for words containing similar kanji. In this case, there’s a popular genre of young-adult fiction involving “villainess” (悪役令嬢, “antagonist young lady”) characters, so you can see a lot of examples of book covers:

                      https://www.google.com/search?q=悪役令嬢&source=lnms&tbm=isch

                      1. 1

                        Ohh, that’s a misunderstanding, should have worded my comment better. I am fully aware of that and have no issue reading that. It’s just that the Chinese version of code point has a glyph, that has a totally different appearance: https://raw.githubusercontent.com/FrostKiwi/Nuklear-wiki/main/sc-vs-jp.png

                        1. 2

                          I’m sorry, I’m having trouble understanding what you mean. In your linked image, both versions of 「冷」 look normal to me. The left hand side is handwriting, the right hand is typeset. It’s similar to how “a” or “4” have two commonly used forms.

                          There’s some Japanese-language coverage of the topic at the following links, if you’re interested:

                          The important thing is to not accidentally use the typeset version when you are practicing handwriting.

                          1. 1

                            Oooooooooooh. Today I learned^^ When I hand write I am fully aware of radicals like 言、心、必 and friends being different when compared to the printed version. I was not aware the same is true for 令. Many thanks for explaining! The misunderstanding was on my part, thx for your patience.

              2. 4

                We can see similar issues in Europe, too. For example, Serbian and Bulgarian Cyrillic have some subtle differences and readability depends on the font used. Most of the fonts optimize or only have Russian/Bulgarian variants, but not Serbian/Macedonian. For example, b, p, and t are written differently, while other letters are pretty similar.

                1. 2

                  Ouch, that is a gotcha. Although I can understand the unicode foundation not wanting duplicate varities for every variant of every kanji character out there.