1. 3
  1. 1

    The article is a gentle intro to the perils of various unicode encodings.

    Borderline too gentle. Because of combining characters, it’s not just emoji where a character can consist of multiple codepoints. Depending on unicode normalization, that assumption breaks down already in the first French example: Those 46 bytes will become 50 bytes on a Mac.

    French combining character example:

    printf 'e\xcc\x81cole\n'
    1. 3

      Yes, assuming that one code point equals one user-perceived character for everything except emojis is a very odd choice. There are some very long sequences of combining diacritics that are possible and I don’t that their approach works. It feels very much like an English-only-speaking person’s view of unicode.