1. 35
  1.  

  2. 4

    Thanks so much for this tool, I love having a command line utility to query the unicode database!

    1. 3

      This seems actually useful, could replace a good part of my web searches that inevitably end with me search for the copy-and-pastable symbole within a fileformat.info result.

      The searching seems to need some tweaks, though. E.g. looking for a regular smiley, none of “smile”, “smiley”, “happy” give the wanted result, while “face” lists too many. It turns out the right search word is “smiling”, but maybe there should be some form of aliases?

      I also had trouble with the regular red heart, but that may be of a different kind?

      $ echo "❤️ " | ~/go/bin/uni identify
           cpoint  dec    utf-8       html       name
      '❤'  U+2764  10084  e2 9d a4    ❤   HEAVY BLACK HEART (Other_Symbol)
      '◌️'  U+FE0F  65039  ef b8 8f    ️   VARIATION SELECTOR-16 (Nonspacing_Mark)
      

      How would I find this using search?

      Regarding search, some more ideas:

      • looking up by emoticon, e.g., uni identify "(:" or uni identify "<3"
      • looking up by short code, e.g. uni identify :heart: (are these standardized?)

      And a bit of a bug regarding that other stdin UX thread:

      $ echo "" | ~/go/bin/uni identify
      $ i: reading from stdin...
      
      1. 2

        The searching seems to need some tweaks, though. E.g. looking for a regular smiley, none of “smile”, “smiley”, “happy” give the wanted result, while “face” lists too many. It turns out the right search word is “smiling”, but maybe there should be some form of aliases?

        Yeah, adding more search terms is marked as “TODO” in the code. It’s a bit tricky as it’s very easy to get way too many matches and/or pollute the output with a lot of keywords, which isn’t useful either. This is one reason I worked on a GUI emoji picker based on this code last week, but I had a lot of problems getting GTK to show ZJW sequences well, so I kind of gave up on that for now, but basically I’m running in to the limitations of dmenu’s plain text filtering.

        I rarely use uni e <search> by the way, but instead use the “emoji-common” groups from dmenu-uni which reduces the number of emojis to a more manageable number (from about 1600 to 200).

        I also had trouble with the regular red heart, but that may be of a different kind? [..] How would I find this using search?

        Just in case this wasn’t clear – and the documentation should probably make this a bit clearer – but the print, search, and identify commands work only on codepoints. They have no concept of multiple codepoints combing to form a single character (or “graphmeme”, if you wish). I basically use identify mostly as a “Unicode-aware hexdump -C”.

        At any rate, it shows up with e.g. uni emoji heart, or uni emoji ‘red heart’for an exact match. It's a bit hidden in there, because apparently we need hearts in 20 shapes and colours 🤷‍♂️ You have the same when you type:heart` in e.g. WhatsApp, but because the emojis are shown in colour and quite large it’s reasonably obvious. This is again kind of running in to the limits of what you can do with this kind of plain text search.

        1. 2

          HEAVY BLACK HEART is the name of the red heart, as it was named as such before emoji gained color. For older Unicode characters (before color), “white” means outlined and “black” means filled in.

          1. 1

            The search problem is pretty tough to solve, as some of the unicode descriptions use a particular english dialect, for instance:

            $ uni s poop
            no matches
            

            damn British! :)

            One possible solve would be to augment the descriptions with information from another free source, like wikipedia

            1. 3

              That’s actually specified in the Unicode CLDR (“Common Locale Data Repository”):

              $ grep poop en.xml
              <annotation cp="💩">dung | face | monster | pile of poo | poo | poop</annotation>
              

              It contains many useful aliases, for example for the pirate flag:

              <annotation cp="🏴‍☠️">Jolly Roger | pirate | pirate flag | plunder | treasure</annotation>
              

              I just haven’t added support for that.

              1. 1

                oh very cool

          2. 3

            FWIW, the built-in “describe-char” function in Emacs is quicker to use, and will bring up a buffer with all of this information and more. I suppose it depends on the use case which is more convenient.

            Here’s an example where the font I use in Emacs doesn’t support the glyph:

                         position: 199390 of 199390 (100%), column: 7
                        character: 🌍 (displayed as 🌍) (codepoint 127757, #o371415, #x1f30d)
                          charset: unicode (Unicode (ISO10646))
            code point in charset: 0x1F30D
                           script: symbol
                           syntax: w 	which means: word
                         category: .:Base
                         to input: type "C-x 8 RET 1f30d" or "C-x 8 RET EARTH GLOBE EUROPE-AFRICA"
                      buffer code: #xF0 #x9F #x8C #x8D
                        file code: #xF0 #x9F #x8C #x8D (encoded by coding system utf-8-unix)
                          display: no font available
            
            Character code properties: customize what to show
              name: EARTH GLOBE EUROPE-AFRICA
              general-category: So (Symbol, Other)
              canonical-combining-class: 0 (Spacing, split, enclosing, reordrant, and Tibetan subjoined)
              bidi-class: ON (Other Neutrals)
              decomposition: (127757) ('🌍')
              mirrored: N
            

            And here’s an example where it does:

                         position: 4 of 4 (75%), column: 4
                        character: 😈 (displayed as 😈) (codepoint 128520, #o373010, #x1f608)
                          charset: unicode (Unicode (ISO10646))
            code point in charset: 0x1F608
                           script: symbol
                           syntax: w 	which means: word
                         category: .:Base
                         to input: type "C-x 8 RET 1f608" or "C-x 8 RET SMILING FACE WITH HORNS"
                      buffer code: #xF0 #x9F #x98 #x88
                        file code: #xF0 #x9F #x98 #x88 (encoded by coding system utf-8-unix)
                          display: by this font (glyph code)
                xfthb:-VL  -VL Gothic-normal-normal-normal-*-14-*-*-*-*-0-iso10646-1 (#x3EB0)
            
            Character code properties: customize what to show
              name: SMILING FACE WITH HORNS
              general-category: So (Symbol, Other)
              canonical-combining-class: 0 (Spacing, split, enclosing, reordrant, and Tibetan subjoined)
              bidi-class: ON (Other Neutrals)
              decomposition: (128520) ('😈')
              mirrored: N
            
            1. 1

              Wow, this is exactly what I needed. This will replace my queries to emojipedia etc. Thanks for sharing and creating this tool @arp242!

              1. 1

                Wonderful. Feature requests if they don’t already exist:

                1. 1

                  Have you considered a machine-readable output format? According to the examples, the output is a table/relation, so the CSV is probably the simplest usable format. Or you can generate XML or Recfile – they are also simple to generate. XML can be read from almost any language/platform and Recfiles can be processed in Recutils or in Relational pipes (can read also CSV and XML).