1. 61

  2. 34

    Our sbase project, implementing the POSIX coreutils, makes cut -c behave as expected because it handles code points instead of bytes (and offers the flag -b for explicit byte-ranges).

    Still left to do is support for grapheme clusters. For that purpose I developed a super-easy-to-use and simple C99-library called libgrapheme (see grapheme.h and the manuals) that automatically parses the Unicode standard files to offer extended grapheme break point detection with all bells and whistles (full test-coverage, automatic test generation from Unicode, emoji support, etc.).

    Statically linked it only adds around 25K to a binary. Maybe it is useful to somebody, though I didn’t get around to write a README and make a release yet. Let me know if you use it, then I’ll tag a version 1 and set up a README and website sooner than planned! :)

    1. 1

      Wow, when I read “simple” I thought “impossible” but then you said it parses the unicode standard files I was impressed. This is an amazing achievement (if it works, which from a quick glance at the code it seems like it should).

      I think I would like to use it in a couple of projects (but please first focus on making sure the API will be stable, don’t rush to a 1.0 release just because some people want to use it).

    2. 14

      It sounds like the correct title is ’GNU cut considered harmful`, since the author points out that other implementations do the right thing.

      That said, this is quite tricky with UNIX in general. The locale that a program should use defines the character set that it should use for input and output. This is propagated by an environment variable and on *NIX systems (unlike Windows, for example) the environment of a program cannot be modified after it starts. This means that you can start your music player in a BIG5 locale, then run a command that monitors its output in a UTF-8 locale and there’s no standard mechanism for telling the music player that it should switch its output to UTF-8.

      On macOS, Apple largely ignores the UNIX locale for precisely this reason (well, technically, NeXT did 30 years ago and Apple simply inherited that choice): The current locale is part of user defaults which has a mechanism to notify applications that a default has changed. AppKit hooks a handler for this notification that checks if the changed default is the locale and, if so, switches the locale of the application. Most command-line tools on Darwin don’t use this mechanism, unfortunately, so you can easily end up with a mismatched locale between command-line and GUI tools. This is also why Apple added the _l-suffixed variants of all of the locale-aware C-standard library functions (most of which were standardised in POSIX2008): it allows you to use C libraries with an explicit locale_t that is picked from the locale in user defaults, rather than whatever happens to be in the environment.

      1. 7

        Unix and C are defined that way, but I’d argue the environment variable is an incoherent design. It may have worked in the 80’s, but it no longer works in a networked world.

        The locale is metadata, and metadata describes data. It generally should be shipped along with it, on the file system, or like HTTP and other protocols do. It doesn’t belong in an environment var that the program reads!

        You can have files on your disk with different encodings! You get them from the network and store them on disk. The same instance of a program can read two different files in different encodings. grep accepts multiple files :)

        Oil is UTF-8 only and I believe OpenBSD’s shell made the same choice. Good to know OS X has leaned in that direction too. If you need to process data in another encoding you can convert it with iconv or something first.

        Although I think we need to add options for LANG=C, so it’s basically C or UTF-8.

      2. 12

        This isn’t present in the manpage (at least, not on my install), but it is in the info page. And since I never check the info page (does anyone?), I had no clue.

        For GNU tools, the info pages always have the complete documentation. I prefer to look that up online instead of info command though, since I haven’t troubled myself to learn the tricks of info navigation and online pages are visually easier to digest.

        Regarding multibytes, I think very few tools in coreutils support it. I only know of wc -m option (and this will not treat grapheme clusters as a single character). Some more tools that’ll trip you up if you expect multibyte processing: tr, head and tail (edit: just remembered sort -k<f>.<c> as well)

        1. 21

          Yeah, I’d straight up forgot that info pages exist. It feels very “GNU-y” that they decided to keep the actual complete documentation in something that only they use.

          1. 7

            Man pages are Unix, and GNU’s Not Unix.

            I have my issues with the current state of the FSF, and I can intellectually grasp that GNU code can be really gnarly and probably doesn’t need to be now that all the world’s a Linux, but complaining that GNU code isn’t POSIX-compliant or that it doesn’t adhere to the “Unix ethos” is missing the point. GNU’s goal isn’t to recreate Unix but to make a new thing that is better.

            That said, not handling Unicode correctly in 2021 feels a bit off.

            1. 16

              I have a longer blogpost that I mean to write about how deeply frustrated I am with emacs core development (and I say this as someone who’s used it as an editor for years and desperately wants it to succeed), and how I think it stems from GNU/the FSF not being willing to let go of the past and the fact that they’ve failed in one of their fundamental goals. This is probably the most emblematic example of that that comes to mind.

              Also yeah. This would’ve been okay in 2000, or maybe 2010, but… come on.

            2. 3

              Info manuals are can and are exported to HTML, which everyone can access through a web browser. That doesn’t seem ‘something that only they use’.

              1. 6

                I mean that nobody else writes info manuals.

          2. 9

            I thought “considered harmful” means more than there is a bug. Does it crash the system or the disk? Does it enable remote code execution? The article mentions nothing but an annoyance. I suppose if you trim off half of a 4-byte unicode grapheme and display it, the resulting 2-byte blob might, I dunno, mung your display? It still doesn’t cross the threshold of “considered harmful” to me.

            1. 7

              There’s a similar sharp edge in Rust in that it enforces UTF-8 clean strings, but cheerfully lets you create slices like &mystring[0..64] while actually using byte indices. It works perfectly until one day the entire application crashes because someone had a multibyte character wrapped around index 64. It’s possible to code it correctly of course, but geez it can be sneaky sometimes.

              1. 2

                Indexing through [...] is panicking when out of bounds (this is true for slices past their length, and your example). .get(...) on the other hand will return None for those out of valid-bounds cases. I presume it’s a matter of interiorizing this behaviour of the square braces (or, depending on the use-case, explicitly using .as_bytes()).

                1. 3

                  Ah, yes, just like how everybody using c++ vectors indexes with ‘.at()’ instead of ‘[]’…

                  1. 3

                    There’s an easily enabled clippy lint that warns about uses of [].

                    1. 3

                      Not really comparable footguns in terms of potential costs. A guaranteed crash (with a sane error message) is a lot better than UB.

                2. 6

                  GNU tool bugginess aside, a blog named “Gay Robot Noises” that correctly punctuates the band name Godspeed You! Black Emperor is very much My Thing.

                  1. 4

                    FWIW it seems if your LC_CTYPE is set correctly OpenBSD’s cut works as expected. Tested it out myself.