1. 16

A neat discovery: on Debian and Red Hat, Perl is present by default and provides Name.pl, a table of Unicode character codes and descriptions. Now you can quickly search for unicode characters by code or description, simply by grepping that file; this lets you use regexes, gives a compact overview, and is faster than going to a Unicode information websites. (But I’m glad those websites are there when my terminal font doesn’t display the character, for example.)

The grepping is just a one-liner: grep -i 'some character' "$(locate Name.pl | head -1)". The link goes to a script in my dotfiles that passes the result through a Python script that parses the code and adds the character itself to the line.

$ unicode apl
⌶ 02336	APL FUNCTIONAL SYMBOL I-BEAM
⌷ 02337	APL FUNCTIONAL SYMBOL SQUISH QUAD
⌸ 02338	APL FUNCTIONAL SYMBOL QUAD EQUAL
⌹ 02339	APL FUNCTIONAL SYMBOL QUAD DIVIDE
[...]
  1.  

  2. 6

    Nice! While we’re sharing…

    Unicode has an exceptionally well hidden but quite useful tool for exploring sets of Unicode codepoints. e.g., To see all codepoints in the greek script: https://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5Cp%7BGreek%7D&g=&i=

    Or see all codepoints in the Greek script that aren’t in the Letter general category (well, technically, not in the Lu | Ll | Lt | Lm | Lo general categories): https://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5Cp%7BGreek%7D%26%5CP%7BLetter%7D&g=&i=

    You can also just grab the Unicode table directly from unicode.org and grep that:

    $ curl https://unicode.org/Public/10.0.0/ucd/UnicodeData.txt > /tmp/unicodetable
    $ grep -i apl /tmp/unicodetable
    

    Of course, the output isn’t as nice as @Sietsebb’s, since UnicodeData.txt contains lots of other junk. :-)

    1. 2

      Ooh, that is indeed nice. I had a look at the query docs, and the query language reminds me of Mercurial’s revsets or filesets – property-based querying is a delightful thing. (That was also one of the things I liked most in Gmail, back when I started: that I could type from:myfriend has:attachment, instead of having to click through dropdowns and ‘add a line’ buttons in a query dialog box.)

      You can also just grab the Unicode table directly from unicode.org and grep that […] the output isn’t as nice

      Actually, I might end up needing to use that in the future, seeing how isthatforreal pointed out in another reply that Name.pl is on course to disappear without warning. Thanks for pointing it out!

      1. 3

        Ah, here are the docs for UnicodeData.txt: https://www.unicode.org/reports/tr44/tr44-20.html#UnicodeData.txt

        Browsing unicode.org is bewildering at first, and often the hardest part is figuring out how to know what you’re looking for. Once you find the thing you’re looking for though, the docs are generally great!

        1. 4

          Your remark reminded of Cory Foy’s tweet:

          Keep thinking about how valuable it would be for development organizations to have a librarian. There’s often so much institutional knowledge in wikis, comments, shared drives, but so little thought to organization and discoverability.

    2. 5

      Bonus points for using Python to get fields out of a Perl file :)

      1. 3

        btw, the perl equivalent of the python (kinda):

        grep -i apl /usr/share/perl/5.26.2/unicore/Name.pl| \
           perl -C -pe'print chr("0x" + (split " ", $_, 2)[0]), " "'
        

        (since you obviously have perl installed already)

      2. 4

        An easier way to do this, for Emacs users, is to run “M-x grep” (or my preference “M-x rgrep”) and when it prompts for the query string use “C-x 8 Enter” to bring up the insert-char character prompt, where you can type (and tab complete) the Unicode name or hex code. Hit enter when you’re done and it will insert the character into the grep query string and you can continue typing the rest of the query string.

        EDIT: I misunderstood, I thought the goal was to do something like “grep λ myfiles”.

        One way to use Emacs to achieve a similar thing to what OP is actually doing is to use “M-x insert-char” (or “C-x 8 Enter”) and then use tab completion to show a list of matching Unicode names and symbols. It supports wildcard matching using *, so you can do something like “*Lambda” - Tab to bring up a list of character names with “lambda” in them. Then, when the character is inserted, use M-x describe-char to view a lot of information about it (character class, category, bidi, code point, its code in the current encoding, etc.)

        1. 1

          Or M-x counsel-unicode-char for some nice completion if you use https://github.com/abo-abo/swiper , looks like https://lists.gnu.org/archive/html/emacs-devel/2016-01/pngmRVpb5gbKs.png

        2. 3

          If your OS is the one that has unicode(1),

          ; awk '/apl/{l=$0;c="unicode "$1;c|getline;close(c);print $0,l}' /lib/unicode
          ⌶ 2336  apl functional symbol i-beam
          ⌷ 2337  apl functional symbol squish quad
          ⌸ 2338  apl functional symbol quad equal
          ⌹ 2339  apl functional symbol quad divide
          ⌺ 233a  apl functional symbol quad diamond
          ⌻ 233b  apl functional symbol quad jot
          ⌼ 233c  apl functional symbol quad circle
          [...]
          
          1. 4

            That is indeed nice! As is the unicode program by Radovan Gabarík that you mention, which I didn’t know about. I’ve just installed it. It gives way more info than grepping Name.pl, but the info is also very spread-out, and by default only displays the first 10 matches.

            $ /usr/bin/unicode delta
            U+018D LATIN SMALL LETTER TURNED DELTA
            UTF-8: c6 8d UTF-16BE: 018d Decimal: ƍ Octal: \0615
            ƍ
            Category: Ll (Letter, Lowercase)
            Bidi: L (Left-to-Right)
            
            [...]
            
            U+10384 UGARITIC LETTER DELTA
            UTF-8: f0 90 8e 84 UTF-16BE: d800df84 Decimal: 𐎄 Octal: \0201604
            𐎄
            Category: Lo (Letter, Other)
            Bidi: L (Left-to-Right)
            
            
            Too many characters to display, more than 10, use --max option to change it
            

            Hmm, I wonder if there exists anything that takes records, and prints them to a terminal (but not to a pipe, obv.) in flexbox-like cards or tiles? More reader-friendly than table rows, more compact than printing only one record per line.

          2. 2

            Take a look at Unicode::UCD for an example how perl’s “Making Easy Things Easy and Hard Things Possible” is getting a thing of a past.

            1. 2

              That’ll teach me to rely on implementation details :-D :-/. I can understand the desire to use, without further modification, the datafiles supplied by Unicode.org; and probably the new system makes harder things possible; but that flat file was really nice to find, and had a low usage barrier that I really appreciated. Ah well, we’ll get by.

              1. 2

                perl6 lowers the barrier a bit.

                1. 2

                  How’s the performance? Perl5 can also do the search in a one-liner, but it’s very slow:

                  grep { charnames::viacode($_) =~ /$search/i } 0..0x10FFFF;
                  

                  Here’s a full script with the imports and formatted printing in case anyone wants to test:

                  use open qw/:std :utf8/;
                  use charnames ':full';
                  my $search = shift;
                  my @found = grep { charnames::viacode($_) =~ /$search/i } 0..0x10FFFF;
                  print "$found[0] - " . charnames::viacode($found[0]) . "\n";
                  
                  1. 2

                    Slow as well.