1. 38

  2. 13

    Ordinarily I really enjoy Raymond Chen’s blog and learn a lot from it. In this case, I’m pretty sure he’s just spouting bullshit.

    I tried to track down the sources of information, and it seems this blog is a rehash of a Stack Overflow post. But that post seems to be wrong to me. It ascribes to KS X 1026 the idea that jamos are “distinct from” their combined form, which doesn’t make sense to me. It strongly implies that the Korean standard calls for grapheme cluster boundaries to be placed between jamo, for example between an L and a V, but that’s just not the case.

    Now one thing that is true is that in KS X 1026, encoding a syllable as a sequence of jamo is “not recommended” if the same syllable can be encoded as a precomposed syllable. But I don’t see anywhere in the standard permission for software to go around treating NFD as separate graphemes.

    I have a very good guess what is going on in that spec. Modern Korean is not a problem, but there are archaic symbols such as ㅿ (Hangul Letter Pansios). If a syllable (가ᇫ for example) contains one of these, there is no precomposed syllable, so that jamo needs to be encoded separately. The Unicode spec gives two choices: NFD, where each of the three jamo is encoded separately, or NFC, which tries to compose as much as possible. In modern Korean, NFC gives you the composed syllable blocks, which is all good. But it takes that syllable to U+AC00 U+11EB, which some software has problems with. And it’s not hard for me to see why text layout algorithms would have trouble with it; it’s a mix of precomposed syllables (the first two jamo) and individual jamo (the last one, which is archaic). It’s easier to deal with just one or the other.

    So what the standard attempts to do (misguidedly, in my opinion), is to recommend an encoding which is neither purely NFC nor NFD, but kinda code switches between them. Modern Korean is NFC, but as soon as you get one of these archaic letters, the syllable goes to NFD. The reason I think it’s misguided is that you’re putting a burden on general software to use these funky normalization rules just to make life easier for text layout that’s not smart enough to deal with the generality of Unicode.

    Now Postel’s robustness principle - “Be liberal in what you accept, and conservative in what you send” - is debated, but it’s a pretty good rule for text layout. And it’s a fairly firm principle of Unicode that NFD and NFC should generally be treated the same. What I think happened here is that Microsoft misinterpreted KS X 1026 as saying they could expect all text to be encoded according to the funky normalization rules so they wouldn’t have to display pure NFC or pure NFD if that’s what it got.

    I could be wrong here, but I also don’t think the blog is right.

    1. 8

      I don’t speak Korean but I’ve enjoyed my travels there. Hangul is one of the most elegant and beautiful writing systems out there.

      1. 6

        It is. It’s one of those times when you read about something on wikipedia, then read about the person who created it, and go “oh, so THAT’S what genius looks like in distant history”.

        1. 3

          Sequoyah probably comes the closest in terms of creating an efficient and easy to learn script.

      2. 6

        Happy 한글 day! Korean characters decompose to individual jamo under unicode normalisation NFD and compose back up to whole syllable characters with normalisation NFC :)

        As a Korean learner, it was really interesting seeing how the (seemingly simple!) alphabet is implemented under Unicode. I wrote a Minecraft mod for Korean input too!

        1. 2

          Another Windows + Korean fun fact! The default font on a Korean version of Windows has the backslash (\) symbol made to look like the Korean Won currency symbol (₩, often 원), so you get funny looking paths in cmd.exe, with path elements delimited by Won.

          1. 3

            Same with ¥ on Japanese locales.

          2. 1

            Maybe this is in the post and I missed it, but I still don’t understand why all other implementations, including ICU, ignore the supposed standard.

            1. 3

              Second comment:

              Are you sure this isn’t a case of competing standards rather than ICU ignoring or following a mere de facto standard? If I look at UAX #29, which contains Unicode’s rules for grapheme clustering, it appears to state pretty clearly: “Do not break Hangul syllable sequences.“

            2. 1

              On Mac, shell script arguments are passed under NFD normalization (individual jamo) when you usually want NFC (syllable blocks).

              Is there a way to change that default? I’ve been using this to convert to NFC, which doesn’t seem elegant:

              echo "$1" | node -e 'console.log(require("fs").readFileSync(0, "utf-8").normalize("NFC"))' | ...
              1. 3

                They are likely passed however they were typed, so it would be an input system setting more than a shell setting.