This seems actually useful, could replace a good part of my web searches that inevitably end with me search for the copy-and-pastable symbole within a fileformat.info result.
The searching seems to need some tweaks, though. E.g. looking for a regular smiley, none of “smile”, “smiley”, “happy” give the wanted result, while “face” lists too many. It turns out the right search word is “smiling”, but maybe there should be some form of aliases?
I also had trouble with the regular red heart, but that may be of a different kind?
$ echo "❤️ " | ~/go/bin/uni identify
cpoint dec utf-8 html name
'❤' U+2764 10084 e2 9d a4 ❤ HEAVY BLACK HEART (Other_Symbol)
'◌️' U+FE0F 65039 ef b8 8f ️ VARIATION SELECTOR-16 (Nonspacing_Mark)
How would I find this using search?
Regarding search, some more ideas:
looking up by emoticon, e.g., uni identify "(:" or uni identify "<3"
looking up by short code, e.g. uni identify :heart: (are these standardized?)
And a bit of a bug regarding that other stdin UX thread:
The searching seems to need some tweaks, though. E.g. looking for a regular smiley, none of “smile”, “smiley”, “happy” give the wanted result, while “face” lists too many. It turns out the right search word is “smiling”, but maybe there should be some form of aliases?
Yeah, adding more search terms is marked as “TODO” in the code. It’s a bit tricky as it’s very easy to get way too many matches and/or pollute the output with a lot of keywords, which isn’t useful either. This is one reason I worked on a GUI emoji picker based on this code last week, but I had a lot of problems getting GTK to show ZJW sequences well, so I kind of gave up on that for now, but basically I’m running in to the limitations of dmenu’s plain text filtering.
I rarely use uni e <search> by the way, but instead use the “emoji-common” groups from dmenu-uni which reduces the number of emojis to a more manageable number (from about 1600 to 200).
I also had trouble with the regular red heart, but that may be of a different kind? [..] How would I find this using search?
Just in case this wasn’t clear – and the documentation should probably make this a bit clearer – but the print, search, and identify commands work only on codepoints. They have no concept of multiple codepoints combing to form a single character (or “graphmeme”, if you wish). I basically use identify mostly as a “Unicode-aware hexdump -C”.
At any rate, it shows up with e.g. uni emoji heart, or uni emoji ‘red heart’for an exact match. It's a bit hidden in there, because apparently we need hearts in 20 shapes and colours 🤷♂️ You have the same when you type:heart` in e.g. WhatsApp, but because the emojis are shown in colour and quite large it’s reasonably obvious. This is again kind of running in to the limits of what you can do with this kind of plain text search.
HEAVY BLACK HEART is the name of the red heart, as it was named as such before emoji gained color. For older Unicode characters (before color), “white” means outlined and “black” means filled in.
FWIW, the built-in “describe-char” function in Emacs is quicker to use, and will bring up a buffer with all of this information and more. I suppose it depends on the use case which is more convenient.
Here’s an example where the font I use in Emacs doesn’t support the glyph:
position: 199390 of 199390 (100%), column: 7
character: 🌍 (displayed as 🌍) (codepoint 127757, #o371415, #x1f30d)
charset: unicode (Unicode (ISO10646))
code point in charset: 0x1F30D
script: symbol
syntax: w which means: word
category: .:Base
to input: type "C-x 8 RET 1f30d" or "C-x 8 RET EARTH GLOBE EUROPE-AFRICA"
buffer code: #xF0 #x9F #x8C #x8D
file code: #xF0 #x9F #x8C #x8D (encoded by coding system utf-8-unix)
display: no font available
Character code properties: customize what to show
name: EARTH GLOBE EUROPE-AFRICA
general-category: So (Symbol, Other)
canonical-combining-class: 0 (Spacing, split, enclosing, reordrant, and Tibetan subjoined)
bidi-class: ON (Other Neutrals)
decomposition: (127757) ('🌍')
mirrored: N
And here’s an example where it does:
position: 4 of 4 (75%), column: 4
character: 😈 (displayed as 😈) (codepoint 128520, #o373010, #x1f608)
charset: unicode (Unicode (ISO10646))
code point in charset: 0x1F608
script: symbol
syntax: w which means: word
category: .:Base
to input: type "C-x 8 RET 1f608" or "C-x 8 RET SMILING FACE WITH HORNS"
buffer code: #xF0 #x9F #x98 #x88
file code: #xF0 #x9F #x98 #x88 (encoded by coding system utf-8-unix)
display: by this font (glyph code)
xfthb:-VL -VL Gothic-normal-normal-normal-*-14-*-*-*-*-0-iso10646-1 (#x3EB0)
Character code properties: customize what to show
name: SMILING FACE WITH HORNS
general-category: So (Symbol, Other)
canonical-combining-class: 0 (Spacing, split, enclosing, reordrant, and Tibetan subjoined)
bidi-class: ON (Other Neutrals)
decomposition: (128520) ('😈')
mirrored: N
Have you considered a machine-readable output format? According to the examples, the output is a table/relation, so the CSV is probably the simplest usable format. Or you can generate XML or Recfile – they are also simple to generate. XML can be read from almost any language/platform and Recfiles can be processed in Recutils or in Relational pipes (can read also CSV and XML).
Thanks so much for this tool, I love having a command line utility to query the unicode database!
This seems actually useful, could replace a good part of my web searches that inevitably end with me search for the copy-and-pastable symbole within a fileformat.info result.
The searching seems to need some tweaks, though. E.g. looking for a regular smiley, none of “smile”, “smiley”, “happy” give the wanted result, while “face” lists too many. It turns out the right search word is “smiling”, but maybe there should be some form of aliases?
I also had trouble with the regular red heart, but that may be of a different kind?
How would I find this using search?
Regarding search, some more ideas:
uni identify "(:"
oruni identify "<3"
uni identify :heart:
(are these standardized?)And a bit of a bug regarding that other stdin UX thread:
Yeah, adding more search terms is marked as “TODO” in the code. It’s a bit tricky as it’s very easy to get way too many matches and/or pollute the output with a lot of keywords, which isn’t useful either. This is one reason I worked on a GUI emoji picker based on this code last week, but I had a lot of problems getting GTK to show ZJW sequences well, so I kind of gave up on that for now, but basically I’m running in to the limitations of
dmenu
’s plain text filtering.I rarely use
uni e <search>
by the way, but instead use the “emoji-common” groups fromdmenu-uni
which reduces the number of emojis to a more manageable number (from about 1600 to 200).Just in case this wasn’t clear – and the documentation should probably make this a bit clearer – but the
print
,search
, andidentify
commands work only on codepoints. They have no concept of multiple codepoints combing to form a single character (or “graphmeme”, if you wish). I basically useidentify
mostly as a “Unicode-aware hexdump -C”.At any rate, it shows up with e.g.
uni emoji heart
, or uni emoji ‘red heart’for an exact match. It's a bit hidden in there, because apparently we need hearts in 20 shapes and colours 🤷♂️ You have the same when you type
:heart` in e.g. WhatsApp, but because the emojis are shown in colour and quite large it’s reasonably obvious. This is again kind of running in to the limits of what you can do with this kind of plain text search.HEAVY BLACK HEART
is the name of the red heart, as it was named as such before emoji gained color. For older Unicode characters (before color), “white” means outlined and “black” means filled in.The search problem is pretty tough to solve, as some of the unicode descriptions use a particular english dialect, for instance:
damn British! :)
One possible solve would be to augment the descriptions with information from another free source, like wikipedia
That’s actually specified in the Unicode CLDR (“Common Locale Data Repository”):
It contains many useful aliases, for example for the pirate flag:
I just haven’t added support for that.
oh very cool
FWIW, the built-in “describe-char” function in Emacs is quicker to use, and will bring up a buffer with all of this information and more. I suppose it depends on the use case which is more convenient.
Here’s an example where the font I use in Emacs doesn’t support the glyph:
And here’s an example where it does:
Wonderful. Feature requests if they don’t already exist:
Have you considered a machine-readable output format? According to the examples, the output is a table/relation, so the CSV is probably the simplest usable format. Or you can generate XML or Recfile – they are also simple to generate. XML can be read from almost any language/platform and Recfiles can be processed in Recutils or in Relational pipes (can read also CSV and XML).
Wow, this is exactly what I needed. This will replace my queries to emojipedia etc. Thanks for sharing and creating this tool @arp242!