1. 3

  2. 2

    Great article. A few other things:

    • Most English speakers will say UTF-8 should be the default because it stores almost everything in one byte. Most other languages that use Latin characters will average a bit over one byte per character (accented characters are typically two, though if the accent is a combining diacritic then you may end up using three). Even most emoji are three. If you have customers using CJK languages, they will average around three bytes per character with UTF-8 but only two with UTF-16 (though most emoji will be 4). UTF-8 is not a space optimisation for everyone and it’s slightly more complex to process because you have four cases per character instead of two.
    • The article talks about code points versus bytes, but code points are not glyphs. Unicode has very complex rules for finding grapheme clusters and that’s typically what you care about when doing anything human-facing.
    • Unicode collation is terrifyingly complex. Even the differences between French and German (which use almost identical character sets) are significant and there are even some between different countries that nominally use the same language. This is one of my biggest objections to shoving emojis into unicode. How should different countries flags be sorted, or happy face relative to poop? The question doesn’t make sense, because they are pictures, not characters.