1. 18

  2. 3

    the Latin Kelvin sign K which is code point U+212A. It has a canonical decomposition to the Latin capital letter K which has the code point U+004B

    Hmm. I guess there’s a reason for this but it’s… interesting that any kind of normalization would just intentionally lose semantics.

    1. 1

      For those interested in Unicode, I would recommend to take a look at the work @FRIGN has been doing for years on suckless regarding Unicode. Search for Unicode in his personal website to see two talks of him on the subject (which were the best talks on the subject I’ve seen so far!), as well as his libgraphene.

      In the first talk, he goes a little more technical than the article on why UTF-8 (comparing to UTF-16 and UTF-32), as well as the reason the reasons to compare strings in the canonical form, and (if I remember correctly, in the second talk) addressing string length problems on composed characters.

      1. 2

        Thank you for the compliments! I’m glad you liked my talks on the topic and recommend libgrapheme.

        I had the chance to continue work on libgrapheme this week (not committed yet), working out the long-overdue API stabilization. Even though Unicode remains to be relatively complicated, I’m glad to now know that the “relevant” algorithms from the specification are possible to implement within a simple C-API, which is the goal after all. :)