1. 91
  1. 23

    Considering the Suckless project’s usual stance on anything invented after 1977, this looks, surprisingly enough, like something that doesn’t such much at all. 👌✅

    1. 23

      It is my opinion that even though Unicode has many quirks and they should not have touched the whole emoji thing, it still is here to stay and processing it properly becomes more and more important the more people from non-Western countries join the internet and use software.

      Maybe this simple library motivates more people to do it, possibly even in embedded development. I totally understand why many choose not to include a massive dependency like ICU or libicustring into their software, let alone for the aforementioned embedded development where this was outright impossible.

      1. 12

        they should not have touched the whole emoji thing

        I strongly disagree. Because emoji are standardized in Unicode, they are automatically accessible to blind users, with no extra effort on the part of the writer, unlike, say, random images. Perhaps emoji are frivolous. But people like to use them, and I’m glad they’re accessible.

        1. 9

          One more advantage to emojis is that the implementors’ drive to support them also lead to better support of more recent Unicode algorithms which benefits general language processing. So even though I’m not a fan I don’t see it black and white.

          1. 6

            So even though I’m not a fan I don’t see it black and white.

            Well, provided your font rendering library supports colour emoji.

            1. 1

              Haha good one! :)

    2. 8
      1. 13

        I know the Julia developers to always do a fine job. The problem with utf8proc is that it the API is built around wrong-assumptions in some places (which is due to the fact that the Unicode standard still evolved when utf8proc was developed). Examples are the case-mapping-functions (case-mapping is not 1-to-1 anymore) and stateless break-detection, but this is not a big deal and can be amended. It also depends on Ruby and Perl (?) at compile-time, which could be difficult to provide in some contexts, and there is no separation into code units, which makes static linking very inefficient.

        One fundamental difference is the approach in the break algorithm detection: They determine the break class of each codepoint right away, making a scan over all property-tables necessary, while libgrapheme determines classes on an as-needed basis, reflecting the fact that the breaking algorithm was designed to handle the most common cases. This probably gives better cache-locality, but this would need to be investigated further.

      2. 10


        Requirements: A C99-compiler and POSIX make.

        $ wc -l gen/* data/* src/*
            93 gen/character-prop.c
            19 gen/character-test.c
           437 gen/util.c
            41 gen/util.h
          1459 data/GraphemeBreakProperty.txt
           630 data/GraphemeBreakTest.txt
          1297 data/emoji-data.txt
           219 src/character.c
           208 src/utf8.c
            76 src/util.c
            28 src/util.h
          4507 total
        $ wc -l test/*
           64 test/character-performance.c
           45 test/character.c
          317 test/utf8-decode.c
           93 test/utf8-encode.c
           11 test/util.c
           11 test/util.h
          541 total

        I am so all over this for my project.

        1. 8

          1459 data/GraphemeBreakProperty.txt

          630 data/GraphemeBreakTest.txt

          1297 data/emoji-data.txt

          Keep in mind that these are not part of the codebase and optional, given they’re simply the Unicode data-tables. If you want to reduce it even further, call “make clean-data”, which will remove the data and download it in on demand. I kept it in the tarball to make it self-contained. At around 60K in total the tarball is very small.

          You have a very interesting project going there. Lua is a good choice for that.

          1. 10

            Yeah, I did understand that. Sorry, I didn’t mean to make your project seem less parsimonious than it is. The way I see it, compressing all the complexity of human grapheme clusters down to 4.5kLoC is a fantastic achievement. One of the few times lately where I feel I’m building on the shoulders of giants (Unicode and you).

            1. 12

              Thank you for this very kind compliment!

              It will be canonical to add support for word, sentence and line breaks, given most of the processing infrastructure is in place, and due to separate code units won’t affect binary size of applications only checking for character breaks, for example (one downside of utf8proc).

              This is planned, but only after the search-algorithm has been optimized in regard to grapheme clusters and some automatic benchmarks on good data are in place to have a baseline. libgrapheme currently uses binary search, utf8proc has jump-tables, but I think theoretically optimal would be a form of interpolation search which can give up to a logarithmic speedup over binary search (O(log(log(n)) over O(log(n)), but it depends on the data and can be O(n) in the worst case). The goal is to become the fastest library for that purpose, and I think it can be done elegantly.

        2. 10

          I used to be an advocate for Suckless, but I learned the hard way after years of using their software, when I tried to build upon dmenu(1) and surf(1), that it just plain sucked, no “less” about it. The bugs that I had ignored and occasionally reported were just symptoms of a much, much deeper, disorder. The code quality was extremely poor, the functions were interrelated in ways that would make a genealogist scream and cower in terror, the entire thing was not written with the concept of “separation of concerns” and gave me such a bad code headache that I had to lie down. I drifted away and never looked back, preferring plan9port when it was appropriate.

          This, however, seems to be constructed extremely well! And seems worthy enough for inclusion in my own projects. Huzzah! Maybe it shows an improvement in the design tastes of the suckless project as-a-whole, perhaps? Only time can tell.

          1. 21

            Suckless is not a homogenous group and the coding quality varies by project. I’m happy to hear you like this project and I hope it will be useful to you. Don’t hesitate to send me an E-Mail for further discussion, if you like, also regarding the problems you observed.

            1. 2

              Curious if you looked at the plan9port code at all, since they also use global variables?

              1. 2

                The problem is not with global variables, but… I mean it’s hard to describe. Literally everything about the dmenu code was utterly horrifying to deal with, to the point that I wrote my own dmenu-like

                1. 1

                  Didn’t mean it as a gotcha, just curious if you would’ve had the same reaction to the plan9port code.

                  The amount of code in dmenu vs. alternatives is categorically not horrifying to deal with, so it has that going for it. Such bad code, and in such small portions!

            2. 5

              Do I understand the purpose of the library correctly?

              A library like https://github.com/antirez/sds gives you dynamic memory management similar to C++ strings, but does not provide you with the ability to discern Unicode user-perceived characters. In contrast, libgrapheme assumes you already handle the memory management part like allocating strings and provides functions to discern and work on user-perceived characters as defined by the unicode. These two (sds + libgrapheme) libraries have two different purposes and can be used together for a more complete string processing package.

              1. 6

                Your observation is correct. You can pass both length- and NUL-delimited strings to libgrapheme-functions. Given SDS strings are NUL-terminated (as far as I understand they store metadata after the NUL-byte), you can simply pass a sds directly to any libgrapheme-function. The detection of user-perceived characters happens two layers above the basic byte-level that sds provides: First you have to decode UTF-8 (or any other encoding) to obtain codepoints, which in turn need to be analyzed to detect those spots where there is a character break (i.e. grapheme cluster break). libgrapheme gives you, for ease of use, the byte-offset of such breaks directly, so you’ll never have to work with codepoints and grapheme clusters as new data structures.

              2. 8

                Let’s go! Finally not rust!

                1. 16

                  Yeah, I don’t have all day to wait for my code to compile.

                  1. 4

                    I wanna start naming my projects after glass or plastic to reflect that they are not written in rust.

                    1. 3

                      I thought about naming my projects after metals in the platin group (Ruthenium, Rhodium, Palladium, Osmium, Iridium and Platinum) because they don’t rust. I’m glad to hear the counter-culture is growing.

                    2. 1

                      So now developers build identity around not using Rust? Ok 🙄

                    3. 3

                      This looks very cool, thank you very much for publishing a lightweight alternative to ICU! I was thinking of a usecase for this library in constrained environments, and I was wondering if it supported streaming? Also, could we see UTF-8 validation performance benchmarks? :)

                      1. 6

                        I assume with streaming you mean reading from a file and giving the next break as a file-offset. Yes, you can do that by rolling your own thing with grapheme_decode_utf8() and grapheme_is_character_break() (also look at the source code of grapheme_next_character_break() that you can trivially modify to read in more bytes when the length returned by grapheme_decode_utf8() is larger than the buffer size), but another idea (which is also more performant) is to read in chunks of a fixed size (like 4096) and operate on them. If at some point the offset returned by grapheme_next_character_break() is pointing at the end, simply shift so your previous starting position is at the beginning and fill it up again, rerunning the function. Keep in mind though that you need an upper bound for grapheme-cluster length.

                        I thought about adding a streaming-function, but there are already multiple ways to do I/O, let alone in C++. The aforementioned chunked approach is also way faster and I don’t want to encourage badly performing idioms. You might say that this is what buffered I/O is for, but my experiments showed massive performance losses anyway. Even with text-input, you have ASCII-characters and generally 1 byte characters anyway, and grapheme clusters are an exception. Another question is: If you read from a file, should the library function reset the file-offset?

                        I hope this helps.

                        1. 1

                          Thank you very much for the answer. :)

                          1. 2

                            You are very welcome! Feel free to send me an E-Mail as a followup. I’m always open for such suggestions and to see how such software is used in some contexts.

                      2. 3

                        If I want to skim the docs to see what functionality this includes, I first have to download, build and install the library? Really?

                        1. 8

                          You can also browse the manual files from the repository-browser.

                          I’ll keep your point in mind, though, and will add examples directly to the website when I come around to it.

                          1. 4

                            You can also use man -l after downloading.