1. 22
  1. 4

    This is neat! But it takes as input a UTF-16 character (wchar_t), when in my experience nearly all Unicode text is encoded as UTF-8*. So that would require an additional step to decode 1-4 UTF-8 bytes to a Unicode codepoint. (Not sure what one does with codepoints above FFFF, where stuff like emoji lives — maybe there are no alphabets up there?)

    It’d be cool to have an algorithm that directly read and wrote UTF-8. I imagine there’s one out there already somewhere.

    * Obj-C’s NSString uses UTF-16 in its API, but I know it internally stores the string in several encodings including ASCII to save space; not sure if it uses UTF-8. (Swift went to an internal UTF-8 representation.) I hear Windows APIs use UTF-16? Mostly UTF-16 seems like a relic from the period before the Unicode architects realized that 65536 characters wasn’t enough for everything.

    1. 1

      Not sure what one does with codepoints above FFFF, where stuff like emoji lives — maybe there are no alphabets up there?

      There are, so this is not a “Unicode-complete” solution, but probably good enough for many use-cases.

      1. 1

        Additionally, there are a bunch of late addition and corrections for the CJK block. 叱 53F1 vs 𠮟 20B9F being a notable example.

      2. 1

        Utf16 is a good choice for where the code comes from - wine which need to deal with windows API which is mostly utf16.

        1. 1

          This was for a Windows project. Platform API is entirely in UTF-16. UTF-8 version would probably require multi-level lookup tables, but these should be compressible along the same lines.