All else aside, I greatly appreciate reading (English) musings on Unicode by people who know enough non-Western languages to have some experience with its rough edges. I have an intellectual understanding that all these issues exist, but English is almost slavishly served by most modern software, Unicode support or no, and my college Spanish is hardly a significant deviation in this respect.
I too really enjoyed this. I have a fairly muddled understanding of the issues here, but am delighted to learn more, and even more delighted that non-Latin scripts are finally getting first-class treatment.
Great read! EGCs sound like a very useful concept, never heard of them before. Impressive to see that one language’s stdlib gets the abstraction right, but I wonder if the EGC is always unambiguous between languages that share a script.
And the answer seems to be “yeah, no, it’s complicated”.
Yes, it’s always unambiguous; EGC is defined independently of any particular locale.
I wonder if another good reason for Rust’s strings not having the algorithms for working with EGCs built-in is that the function for turning a list of codepoints into a list of EGCs changes every time a new Unicode standard comes out and adds to the list of which codepoints count as combiners? and you ideally don’t want to couple “get support for latest Unicode” with “update the language” so it’s much easier to get it from a library/crate which can easily be upgraded.
I know some people look down on languages, like rust, that don’t have a large standard library, but issues like this are a great reason to keep some things out of a stdlib. Rust has taken it to quite the extreme, but I still like the choice. Other considerations are security issues. I believe it was last year that the python core devs (or maybe it was the PSF? Not sure…) voted to keep requests out of the standard library, and one of the primary reasons was so that it could be more agile in responding to security vulnerabilities.
Good point. That would be a very good reason.
Strings are one of those datatypes that developers use so often, and they think they understand… but they don’t. Strings are complicated.
If you want to read more about how Unicode works, I broke it down in detail in a blog post here: https://www.bignerdranch.com/blog/unicode-and-utf-8-explained/
It’s got some Elixir-specific points, but it’s mostly about Unicode and UTF-8.
Selection and backspacing on Unicode text are two gotchas the OP describes that I neglected to mention.