1. 50
    1. 15

      This question is canonically and comprehensively answered by UTS#51:

      https://www.unicode.org/reports/tr51/

      (That said, having read and implemented many Unicode related standards, I’m fully sympathetic to an approximation of the correct answer that saves an order of magnitude in complexity…)

    2. 12

      This article touches on the right solution but never quite gets there: grapheme clusters. If you want to know that a string is exactly one emoji the first thing you need to know is that it is exactly one grapheme. Once you’ve used your text library to get the graphemes and there is only one, then checking that it contains something to make it an emoji is much simpler. \p{Emoji_Presentation} covers at lot of it, and checking for the emoji selector codepoint covers most of what’s left. You don’t have to worry about “what if parts of it are emoji and other parts aren’t” because you already know it is one grapheme so if anything in there makes it an emoji then the whole thing must be a single emoji.

      1. 8

        I got half way down the article before deciding that this was the correct solution with one minor tweak: why do you need the emoji check at all? Why not allow anything that is a single grapheme? There are quite a few glyphs that might be useful as category identifiers that are not emoji.

        Even restricting it to emoji, the system is going to be painful for users. It will suffer from op the worst kind of ontology drift because people will use subtly different glyphs and be very hard to search unless you allow searching by the name of the combined character (in which case, you might consider using words: there’s a reason hieroglyphs died out).

      2. 4

        Good call! I’ll update the article.

        I just tried that out on the emoji data set I’m using. Every glyph was correctly identified as 1 grapheme, but 229 out of 3,664 failed the regex check /\p{Emoji_Presentation}/u. Emoji like ☺️, ☹️, ☠️, 👁️‍🗨️.

        When changed to /\p{Emoji}/, every emoji passes!

        Thanks!

        1. 3

          \p{Emoji} is too broad as it will also match for example the grapheme 1

          1. 3

            Ah! Good to know.

            1. 4

              \p{Emoji} basically means “could be part of an emoji” whereas \p{Emoji_Presentation} means “presents as an emoji” but as you found some emoji contain zero codepoints which present as an emoji by themselves so need to be detected other ways such as checking for the emoji switch codepoint

              1. 8

                Checking for either the VS16 variant selector or the Emoji_Presentation character class matches the whole test set. Huzzah!

    3. 9

      Indian Devanagari language

      I would call Devanagari a writing system rather than a language.

      if you just so happen to be in a country that recognizes Taiwan’s sovereignty

      The US does not recognize Taiwan’s sovereignty, but we do show their flag. It’s about complicated as emoji themselves. :-)

      1. 3

        Plenty of non-sovereign territories have emoji flags, such as the BIOT, and the Canary Islands.

    4. 5

      Love this topic!

      To pile on with another unfair nitpick:

      Your emoji picker includes country flags, but the Unicode Consortium doesn’t want to take sides on whether Taiwan is a real country.

      The valid emoji flag sequences are in fact enumerated in the CLDR, but “depictions of images for flags may be subject to constraints by the administration of that region.”

      1. 2

        Thanks! I’ve updated the post accordingly.

        I’m having trouble telling whether the valid country codes were always specified by Unicode, or whether the situation changed at some point.

        1. 2

          Aren’t they based on the 2 character ISO code ?

          I.e. the decision of what is a country or a region is fobbed off to ISO, and whether it has a flag or not is up to the character set creator.

          1. 2

            Can’t pull a link because I’m on the subway, but the standard makes explicit that they’re not based on ISO 3166, but should be thought of as a 26x26 array that happens to be letter-indexed :shrug:

            1. 1

              I mean, it depends on what you mean by “based on”. More people know about ISO 3166, because it’s the basis for top-level country domains, than about the Common Locale Data Repository. If, say, Scotland gains independence and gets its own 2 letter code, the CLDR will (probably) follow the one assigned by ISO 3166, because otherwise how would the Saltire be represented?

        2. 1

          And the real issue with flag codes is the parity check: in a long uninterrupted string of flag characters, you need to count all the way back to the start to figure out if a given character is the first or second.

    5. 3

      without encouraging users to spiral into reinvent-Dewey-Decimal territory

      For those that do want to spiral into “reinvent-Dewey-Decimal territory”, check out “Melvil Decimal System” from LibraryThing[^mds] and specifically on topic for this article Dewmoji [^dewmoji].

      Dewmoji assigns one emoji per Dewey Decimal class, so:

      “Dewey: 15” (Philosophy and Psychology -> Psychology) becomes 💭😌 .

      “Dewey: 94” (History and Geography -> Europe) becomes 🗺️🏰 .

      “Dewey: 005” (Information -> Computing and Information -> Computer programming, programs, data, security) becomes ℹ️💻💿 .

      [^mds] https://www.librarything.com/mds#

      [^dewmoji] https://blog.librarything.com/2016/07/introducing-dewmojis/

    6. 2

      Why even limit what can be used as a tag? What if I want to tag with 2 emojis? What if I want to tag with a sequence of non-emoji characters? What if I want to embed non-printable characters because my network monitor will turn on a disco ball if it sees the perfect sequence of bytes? Why is this being validated at all. What if I have a special font with my own custom emojis?

      The obsession with validation is a plague. User input should be taken as opaque blobs. My favorite is url/email validation. “a@b” is a perfectly valid email on my LAN. “http://b” is also a valid URL that will resolve on my LAN, but most validators would fail because there’s no TLD.

      While the problem is no doubt interesting, why is your software even bothering to do it? I must have missed that point.

    7. 1

      I have been developing a unicode library (libgrapheme) for a few years now, which especially allows grapheme segmentation of UTF-8 strings. It’s a freestanding (i.e. no stdlib) library, so you can compile it into wasm and use it from there without having to rely on browser functions.