1. 79

  2. 46

    I was wondering whether there was a significant performance improvement of the table lookup glibc does over the naive implementation musl has.

    The answer is a pretty big yes. Glibc’s isalnum is about 6-7 times faster than musl’s.

    Benchmark code: https://paste.sr.ht/~minus/18b44cfe58789bc1fb69494130e859a1189d1772

    1. 28

      Yep, glibc didn’t add all of that weirdness for no reason…people tend to forget that.

      1. 1

        Wouldn’t a big boy switch() statement be faster then ? Something like

        switch(c) {
        case 'a':
        case 'b':
        case 'c':
        case 'Z':
            return 1;
        return 0;

        Is it as fast as a lookup table ?

        1. 5

          It might be, if the compiler was smart enough to compile it into a lookup table.

          Think about how you would generate code for a switch: the naive solution is to make a bunch of conditional branches. If the values are contiguous (say, the 26 English letters plus case variants), you output some early-rejection code to handle values outside that range and then do conditional branches again or use a LUT.

          Maybe the cost of memory access for non-cached LUTs is high enough that a bunch of conditional branches is faster, maybe not. But, relying on the compiler switch codegen is iffy.

          1. 3

            Generating code for switches is a well understood area and the code generation can be pretty good.

            Here’s what an is_digit equivalent compiles into:

                add     eax, -48  
                xor     edx, edx  
                cmp     eax, 10  

            This is a standard trick for optimizing range checks.

                // The usual range check with two compares.  
                low_bound <= value <= high_bound  
                // An equivalent range check with a subtract and an unsigned compare.  
                value - low_bound u<= high_bound - low_bound  

            If you like this sort of bit-twiddling then I would recommend the book Hacker’s Delight by Henry Warren.

            Here’s a partial list of some other techniques that are used.

            • Lookup table
            • Profile driven layout of cases, e.g. testing hot case(s) first
            • Binary search
            • Perfect hashing
            • Tries
            • Combining multiple techniques, e.g. range checks for contiguous tests and binary search for spare tests.
          2. 1

            Wouldn’t a big boy switch() statement be faster then ?

            I wouldn’t assume either option is faster without profiling it first - a switch case in my response to the child post was optimized into 3 instructions.

            The benchmark in the grand parent’s post is pretty much the best possible case for a lookup table. Programs that call ctype functions rarely instead of often may get different results; there could be cache misses.

            This isn’t an issue with inline code such as @zx2c4’s ctype.h functions.

            The answer, as with everything in performance, is to profile your use case and come to a conclusion.

      2. 41

        TLDR of the rest of this post - glibc has to support locales for backward compatibility, musl doesn’t so they can use a simpler approach.

        This rant follows the author’s usual pattern.
        Someone else made different trade offs or decisions than he would have.
        He assumes negative intent and rants that the other opinion is stupid, evil, or both of these.
        Many of his rants show that he doesn’t understand the other perspectives on the problem; that’s what’s happening in this case.

        Locales are a bad idea for modern applications but glibc is stuck with supporting them for backward compatibility. They were a reasonable solution at the time and it’s not surprising that there are some issues 30 years later.

        The glibc implementation is how these functions are normally implemented; you can find a similar implementation in chapter 2 of P.J. Plauger’s The Standard C Library.
        The lookup table uses bitmasks for the different predicates, e.g. isalnum combines the masks for isdigit and isalpha.
        Locales are supported by reading from a different part of the table.
        musl doesn’t support multiple locales so it can use a simpler approach.

        1. 4

          100% agree. With Drew is always like, what is the trade off?

        2. 35

          Here are functions that do the same as the musl versions in this blog post, but are always constant time: https://git.zx2c4.com/wireguard-tools/tree/src/ctype.h This avoids hitting those large glibc lookup tables (cache timing) as well as musl’s branches. It also happens to be faster than both in some brief benchmarking.

          1. 7

            always constant time

            Musl and glibc’s versions are both constant time. (Technically, since || is short-circuiting, musl’s version won’t always take the same amount of time, but it’s still O(1).)

            Do you mean branchless?

            1. 29

              zx2c4 means “constant time” in the side-channel avoidance sense, not the complexity theoretic sense. They’re not the same definition.

          2. 30

            Funnily enough, I pointed out exactly this bug two years ago: https://lobste.rs/s/q5awqv/introducing_scdoc_man_page_generator#c_r5dfyt

            1. 15

              I think the correct function to use here is actually iswalnum (the wide-character version of isalnum). Since wide characters are 32-bit on modern POSIX systems, that should cover all UCS-4 code points. (I don’t know what the correct answer is on Windows, since wide characters are 16-bit there.)

              I think the complexity of the glibc implementation is defensible. Before UTF-8 became ubiquitous, a properly internationalized libc had to support 8-bit character sets like ISO 8859-1. As I recall, Fedora didn’t make UTF-8 the default until 2003 or 2004, and of course, Fedora was bleeding-edge compared to most distros. Maybe some users still have their locale set to a non-UTF-8 locale. Backward compatibility is ugly, but people complain if we break their tools, and well they should.

              Still, the glibc implementation could have been robust to malformed input, such that it wouldn’t segfault.

              1. 12

                When I was in college, I asked a prof how some standard C function was typically implemented. His answer: “I don’t know, but you could probably find out by looking at glibc source”. I laughed. If I were a professor, and one of my students asked me that question today, I’d point them at musl. Free software is great. What’s even better is free software with readable source.

                I say that with all due respect to GNU and the FSF. Your focus will shape your technical decisions, and GNU has focused heavily on portability. I’m sure that quite a bit of what I consider unnecessary obfuscation was done for reasons of portability.

                1. 4

                  His answer: “I don’t know, but you could probably find out by looking at glibc source”. I laughed.

                  Because he suggested to read the source or because he suggested glibc?

                  1. 10

                    I would have laughed at the specific combination of both. Reading glibc’s source is really really really difficult.

                    1. 4

                      Yes, for the suggestion of glibc.

                      For comparison, here’s an isalnum from OpenBSD. It uses a lookup table, is quite readable, and should never segfault.

                      isalnum(int c)
                      	return (c == EOF ? 0 : ((_ctype_ + 1)[(unsigned char)c] & (_U|_L|_N)));
                    2. 2

                      If they’re just after a naive implementatipn, you can point them towards K&R. Most of that book teaches C through teaching you how to build the standard library.

                    3. 9

                      I think this section from the GNU Coding Standards explains why the guts of so much GNU software is so damn weird:

                      If you have a vague recollection of the internals of a Unix program, this does not absolutely mean you can’t write an imitation of it, but do try to organize the imitation internally along different lines, because this is likely to make the details of the Unix version irrelevant and dissimilar to your results.

                      For example, Unix utilities were generally optimized to minimize memory use; if you go for speed instead, your program will be very different. You could keep the entire input file in memory and scan it there instead of using stdio. Use a smarter algorithm discovered more recently than the Unix program. Eliminate use of temporary files. Do it in one pass instead of two (we did this in the assembler).

                      Or, on the contrary, emphasize simplicity instead of speed. For some applications, the speed of today’s computers makes simpler algorithms adequate.

                      Or go for generality. For example, Unix programs often have static tables or fixed-size strings, which make for arbitrary limits; use dynamic allocation instead. Make sure your program handles NULs and other funny characters in the input files. Add a programming language for extensibility and write part of the program in that language.

                      1. 3

                        None of this sounds particularly weird, especially:

                        Make sure your program handles NULs and other funny characters in the input files.

                        1. 3

                          The point is that the programs were intentionally somewhat contorted to avoid being suspected of copyright infringement.

                          Remember, back in the late 80s~early 90s when GNU was getting its start, the only reason anybody used linux over BSD is that the latter were getting (baselessly) sued by AT&T.

                          1. 3

                            The GNU tools and utilities predates the Linux kernel by a significant margin.

                            It’s possible that the legal situation kept BSD back. It’s also possible that it simply wasn’t as fun and rewarding to contribute to BSD as it was to the Linux kernel.

                          2. 1

                            I think the bit about going all-out for speed is why you get these weird lookup tables and a grep that tries to munch multiple bytes at a time.

                        2. 7

                          My understanding is that one of the reasons glibc is so weird is because in the 90’s it had to be portable among many unix’s with each their own libc and their own compiler with their own quirks and bugs and weird assumptions. This resulted in a design and a system of humans behind that design that requires incredibly deep levels of configuration and indirection to make everything !!FAST!! and !!PORTABLE!!.

                          1. 4

                            Isn’t the complicated code in GNU libc related to locales, see: https://github.com/gliderlabs/docker-alpine/issues/144, https://www.openwall.com/lists/musl/2014/08/01/1.

                            Maybe someone with more knowledge can comment.

                            1. 11

                              The funny thing is that isalnum cannot take Unicode codepoints, according to its specification. What’s the point of using locales for unsigned char? 8-bit encodings? But I’d be surprised if it handled cyrillic in e.g. CP1251.

                              Update: Ok, I am surprised. After generating a uk_UA.cp1251 locale, isalnum actually handles cyrillic:

                              $ cat > isalnum.c
                              #include <ctype.h>
                              #include <stdio.h>
                              #include <locale.h>
                              int main() {
                                  setlocale(LC_ALL, "uk_UA.cp1251");
                                  printf("isalnum = %s\n", isalnum('\xE0') ? "true" : "false");
                              $ sudo localedef -f CP1251 -i uk_UA uk_UA.cp1251
                              $ gcc isalnum.c && ./a.out
                              isalnum = true

                              On the practical side, I have never used a non-utf8 locale on Linux.

                              1. 17

                                I guess you haven’t been around in the 90ies then, which incidentally also is likely when the locale related complexity was introduced in glibc.

                                Correctness often requires complexity.

                                When you are privileged to only needing to deal with english input to software running on servers inter US, then, yes, all the complexity of glibc might feel over the top and unnecessary. But keep in mind others might not share that privilege and while glibc provides a complicated implementation, at least it provides a working implementation that fixes your problem.

                                musl does not.

                                1. 10

                                  Correctness often requires complexity.

                                  In this case, it doesn’t – it just requires a single:

                                  _ctype_table = loaded_locale.ctypes;

                                  when loading up the locale, so that a lookup like this would work:

                                  #define	isalpha(c)\
                                          (_ctype_table[(unsigned char)(c)]&(_ISupper|_ISlower))

                                  Often, correctness is used as an excuse for not understanding the problem, and mashing keys until things work.

                                  Note, this one-line version is almost the same as the GNU implementation, once you unwrap the macros and remove the complexity. The complexity isn’t buying performance, clarity, or functionality.

                                  1. 7

                                    Interestingly, this is exactly how OpenBSD’s libc does it, which I looked up after reading this article out of curiosity.

                                  2. 4

                                    Locale support in standard C stuff is always surprising to me. I just never expect anything like that at this level. My brain always assumes that for unicode anything you need a library like ICU, or a higher level language.

                                    1. 4

                                      Glibc predates ICU (of a Sun ancestry) by 12 years.

                                      1. 2

                                        It probably does. A background project I’ve been working on is a Pascal 1973 Compiler (because the 1973 version is simple) and I played around with using the wide character functions in C99. When I use the wide character version (under Linux) I can see it loads /usr/lib/gconv/gconv-modules.cache and /usr/lib/locale/locale-archive. The problem I see with using this for a compiler though (which I need to write about) is ensuring a UTF-8 locale.

                                      2. 2

                                        you haven’t been around in the 90ies then

                                        – true

                                        only needing to deal with english input to software running on servers inter US

                                        – I’ve been using and enjoying uk_UA.utf8 on my personal machines since 2010, that is my point

                                        The rest of your assumptions does not apply to me in the slightest (“only English input”, “servers in the US”). I agree that I just missed the time when this functionality made sense.

                                        Still, I think that the standard library of C feels like a wrong place to put internationalization into. It’s definitely a weird GNUism to try to write as much end-user software in C as possible (but again, it was not an unreasonable choice at the time).

                                        1. 3

                                          Locale support is part of the POSIX standard and the glibc support for locales was all there was back in the 90ies; ICU didn’t exist yet.

                                          You can’t fault glibc for wanting to be conformant to POSIX, especially when there was no other implementation at the time

                                          1. 4

                                            Locale support is part of the POSIX standard

                                            Locales are actually part of C89, though the standard says any locales other than "C" are implementation-defined. The C89 locale APIs are incredibly problematic because they are unaware of threads and a call to setlocale in one thread will alter the output of printf in another. POSIX2008 adopted Apple’s xlocale extension (and C++11 defined a very thin wrapper around POSIX2008 locales, though the libc++ implementation uses a locale_t per facet, which is definitely not the best implementation). These define two things, a per-thread locale and _l-suffixed variants of most standard-library functions that take an explicit locale (e.g. printf_l) so that you can track the locale along with some other data structure (e.g. a document) and not the thread.

                                            1. 2

                                              Oh. Very interesting. Thank you. I know about the very problematic per-process setlocale, but I thought this mess was part of posix, not C itself.

                                              Still. This of course doesn’t change my argument that we shouldn’t be complaining about glibc being standards compliant, at least not when we’re comparing the complexities of two implementations when one is standard compliant and one isn’t.

                                              Not doing work is always simpler and considerable speedier (though in this case it turns out that the simpler implementation is also much slower), but if not doing work also means skipping standards compliance, then doing work shouldn’t be counted as being a bad thing.

                                              1. 3

                                                I agree. Much as I dislike glibc for various reasons (the __block fiasco, the memcpy debacle, and the refusal to support static linking, for example), it tries incredibly hard to be standards compliant and to maintain backwards ABI compatibility. A lot of other projects could learn from it. I’ve come across a number of cases in musl where it half implements things (for example, __cxa_finalize is a no-op, so if you dlclose a library with musl then C++ destructors aren’t run, nor are any __attribute__((destructor)) C functions, so your program can end up in an interesting undefined state), yet the developers are very critical of other approaches.

                                      3. 5

                                        What’s the point of using locales for unsigned char? 8-bit encodings?

                                        Yes, locales were intended for use with extended forms of ASCII.

                                        They were a reasonable solution at the time and it’s not surprising that there are some issues 30 years later. The committee added locales so that the ANSI C standard could be adopted without any changes as an ISO standard. This cost them another year of work but eliminated many of the objections to reusing the ANSI standard.

                                        If you’re curious about this topic then I would recommend P.J. Plauger’s The Standard C Library.
                                        He discusses locales and the rest of a 1988 library implementation in detail.

                                        1. 2

                                          Yes, locales were intended for use with extended forms of ASCII.

                                          Locales don’t just include character sets, they include things like sorting. French and German, for example, sort characters with accents differently. They also include number separators (e.g. ',' for thousands, '.' for decimals in English, ' ' for thousands, ',' for the decimal in French). Even if everyone is using UTF-8, 90% of the locale code is still necessary (though I believe that putting it in the core C/C++ standard libraries was a mistake).

                                          1. 2

                                            Date formatting is also different:

                                            $ LC_ALL=en_US.UTF-8 date
                                            Tue Sep 29 12:37:13 CEST 2020
                                            $ LC_ALL=de_DE.UTF-8 date
                                            Di 29. Sep 12:37:15 CEST 2020
                                      4. 1

                                        My post elsewhere explains a bit about how the glibc implementation works.

                                      5. 3

                                        I’m definitely surprised that isalnum would segfault. What is the reason for the spec not allowing for it to take values outside of EOF and unsigned chars? I especially like how proud they are that they support signed chars also, but segfault on other things still…

                                        1. 4

                                          What is the reason for the spec not allowing for it to take values outside of EOF and unsigned chars?

                                          C was first standardized in 1988 but was already widely used on a wide variety of machines, e.g. ones-complement, sign-magnitude, non power-of-2 word size, word-addressed machines, etc.
                                          The committee emphasized standardizing existing practice and low implementation costs.
                                          Function names, for example, were 6 characters because of limitations in MVS.
                                          It’s unlikely that the committee would have added bounds checking to ctype.h; the runtime costs would have been prohibitive on small machines.
                                          The specification allows implementor to add range checks or define a larger range of values, though.

                                          1. 2

                                            Becuase that’s outside the range of char types?

                                            It seems that GNU libc actually supports characters from -128 to 255 though, becuase a bare char is either a signed or unsigned (compiler’s choice) and they want to ensure both are handled correctly.

                                            1. 3

                                              The isalnum routine accepts an int though, not a char. But the docs say anything that isn’t an unaigned char or EOF is UB. You couldn’t design a better footgun, but I don’t know the history that led to such a thing.

                                              1. 2

                                                getchar() returns either an unsigned character (0..255) or EOF (represented as -1). This value is in the range -1..255, and is represented as an int, which was the only type that could represent this range when getchar() was first implemented. The istype functions are designed to take as input the same values that getchar() returns, so their arguments have type int instead of type unsigned char. I presume that the intent of designing the istype functions to return false on EOF was to remove a footgun. This all happened in the 1970’s, and character sets containing more than 256 characters AFAIK didn’t exist yet.

                                                1. 2

                                                  Maybe, but then they went and made it UB for giving an int outside the range. I guess you could argue it was for performance reasons, but the end result is still a terribly subtle footgun.

                                                2. 2

                                                  Presumably the lack of sum types built into the language led to this.

                                                  1. 2

                                                    union is a poor man’s sum type :)

                                            2. 2

                                              As the article mentions, the standard requires the input value be representable in an unsigned char, or EOF.

                                              Interestingly this has turned into a C++ footgun I’ve managed to fire in the past – using auto with a lambda in an STL algorithm working on a string can result in UB:

                                              count = std::count_if(s.begin(), s.end(), [](auto c) { return std::isalnum(c); });

                                              The problem is also mentioned on the cppreference page for isalnum.