1. 62

The original title cannot be submitted as-is because lobste.rs disallows emojis in titles

  1.  

  2. 20

    lobsters doesn’t allow emojis in titles because it can’t figure out how long they are…

    1. 7

      Interestingly, the length of the string <emoji> is still 7 :-)

      1. 8

        Which makes the title very confusing :)

      2. 1

        Should’ve used Rust I guess

      3. 13

        This was an unexpectedly entertaining read all the way through. In-depth discussion of the various trade-offs around different Unicode encodings, by someone who seems to know what they’re talking about.

        The table at the end comparing information density of by language for various encodings gives some indication of the scope of the unicode project (and it’s fun to play with). Cyrillic scripts seem to get off rather poorly there.

        (Also this isn’t python specific, suggest removing the tag.)

        1. 5

          I added python because there has been some controversy around its unicode type in py3, and this is the clearest argument I’ve seen why it’s a bad idea.

          I didn’t add the tag because it’s specific to Python, but because I believe it’s especially interesting for Python devs.

          1. 3

            Man you freaked me out. I was thinking “I don’t remember posting that!”. (See usernames)

          2. 7

            Just tested this in Julia, and it counts extended grapheme clusters by default just like Swift. Personally, I prefer this behavior since it’s a lot more consistent, predictable, and approachable. If I ask for character count, I want character count. However, if I want byte count, I will ask for byte count instead.

            1. 6

              🤦🏼‍♂️

              1. 2

                I’m sad that JS got this wrong (and yes it did get this wrong) because it was a recent decision AND it is a permanent one. It isn’t just length but string iterability itself that we’ve forever tied to the code point count.

                1. 1

                  wow, it’s almost as if encoding images in a standard that was intended for writing systems causes a lot of problems and cannot be implemented in a consistently sane way; who would have guessed

                  1. 13

                    No.

                    Almost every time people say “emoji make unicode more complex” they’re wrong. Emoji simply force programmers to deal with complexities they were able to ignore in the past, because those complexities affected languages those programmers were less exposed to. Most of the complexity in handling emoji in unicode is existing complexity that affects other scripts.

                    This is not to say that emoji hasn’t brought in its own complexity (they way both country and non-country regional flags are handled are novel systems, for example). But 99% of the time emoji handling is problematic in your code – congratulations, your code probably breaks on other languages as well. This is true for basically everything in the above article.

                    1. 5

                      Almost every time people say “emoji make unicode more complex” they’re wrong. Emoji simply force programmers to deal with complexities they were able to ignore in the past

                      The problems encountered adding emoji support to Emacs have been almost all centered around moving from the completely sensible “text should be monochrome” assumption to “fonts can do colors now for some reason”, and it’s been a huge headache.

                      It’s frustrating to see all this effort spent that could have been used to solve real problems.

                      1. 3

                        Yeah, colors are another new thing emoji have brought in. Font code is complicated a ton by them.

                        But in the context of this post, nothing is emoji-specific here.

                    2. 9

                      It’s worth pointing out that these problems would exist even if emoji did not exist. “Extended grapheme clusters” exist to deal with the complexity of human orthograph and are not solely an emoji thing.

                      I don’t really see a single issue that this article explores whose root cause is “emoji in Unicode”, to be honest.

                    3. 1

                      This was extremely educational. I do however take issue with this:

                      but random access by scalar value is in the YAGNI department

                      Whenever I slice a string, I’m making use of scalar values. It’s really not that unusual to have a fixed width of characters at the start of a string.

                      1. 10

                        Slicing doesn’t require random access. It can be done by saving an iteration point.

                        Rust strings have random access only by byte offsets, but at the same time only allow UTF-8-correct slicing. You iterate over valid codepoint offsets given in bytes, and then use these offsets for correct sub-slicing.

                      2. 0

                        It is wrong, at least in a high-level language. The identity of the character is not how it is encoded. A string should be a sequence of characters. When one wants to deal with its encoding they should be dealing with a sequence of bytes instead. Common Lisp gets it right. Most other high-level language follow the C way of a string is a sequence of short ints have choose between speed or correctness for things like charAt.

                        Informative post nonetheless

                        1. -2

                          Perl 6 gets it right, fwiw:

                          > perl6 -e 'say "🤦".chars' 
                          1
                          
                          1. 3

                            technically the wrong emoji :) curious what it says if you copy and paste the one from the article.

                            1. 4

                              Must have been a copy paste error of some sort – an easy one to make since I don’t actually have proper emojis in my terminal (it looks like this)

                              Perhaps this is correct-er?

                              > "🤦🏼‍♂️".chars
                              1
                              
                              1. 2

                                Thank you for taking the time to do that :).

                            2. 3

                              Except, of course, that the whole point of the article is that treating this behavior as “right” is an oversimplification. It is unstable, it requires tons of extra machinery, and it’s not helpful in most use cases.

                              1. 1

                                I mean if chars means symbols I guess it’s pretty accurate.

                              2. 2

                                So a character is an extended grapheme cluster?