1. 17
  1. 2

    Going to shamelessly plug this write-up (and prior discussion of it on this site) drawn from things I’ve learned over the years for more context on the complexity of case in a world with more than ASCII. The Turkish dotted/dotless “i” is a useful example, but just handling that won’t get you all the way, or anywhere close to all the way, to proper case handling.

    And there’s also the fact — which I didn’t cover in that write-up — that which of the multiple available options you want is often sensitive to what you’re trying to accomplish. For example, comparing strings that are intended for use as security-sensitive identifiers is a different problem, with different constraints and recommended approaches, than title-casing an article’s headline.

    1. 2

      Having worked a lot on this some time ago (I implemented a consumer replicated filesystem that supported a case-insensitive mode across different OSs), this write-up is great.

      My personal (shorter) take on this:

      • “case-insensitive” comparison is not well-defined. The “case-insensitive” part depends on the context and must be explicitly specified.

      • If you need to compare strings “case-insensitively” and deal with Unicode subtleties, you need to know what case folding and Unicode normalization are among other things (the write-up in the parent post is a great start).

      • Every single implementer of this made mistakes at first. Don’t implement it yourself, use a library like utf8proc. Or at least read it first.

      • If you do it for UX reasons, you can use system-provided functions, but if you do it to adhere to specs / match another system’s behavior they will almost always do the wrong thing.

      • Proper Unicode case-insensitive comparison is hard, and proper case-insensitive indexing is even harder (because compare_insensitive(a, b) is not something like compare(lowercase(a), lowercase(b)).

      • Proper Unicode case-insensitive comparison is super compute-expensive, so if you know you can avoid it somehow do it. This is what cURL does here (because they know they can afford to ignore non-ASCII cases). And if you cannot, consider clever caching. Yes, it is that slow.

    2. 1

      I’ve tried hard to stay away from any of the locale or non-ASCII APIs in C/C++ — they seem to be a mess and get in the way more than they help. Making the venerable low-level C functions like strcasecmp and isalpha change their behavior on ASCII characters due to some global state was a terrible idea.

      To really deal with I18N you need purpose-built APIs like ICU, or the various text APIs on Apple platforms, and I’m sure there are others. Core C/C++ was built long ago around ASCII and should stick with it.

      1. 3

        To really deal with I18N you need purpose-built APIs like ICU, or the various text APIs on Apple platforms, and I’m sure there are others. Core C/C++ was built long ago around ASCII and should stick with it.

        The _l-sufficed versions of the APIs in POSIX 2008 originate with Apple. They appear to be designed explicitly to be able to support NSLocale. The C++11 locale APIs were then designed as a thin wrapper around POSIX 2008 locales (annoyingly, the way that facets were implemented in libc++ doesn’t take advantage of this very well and ends up with redundant copies of some very large libc structures). So, if you like OpenStep’s locale handling, you’ll probably like the current C/C++ APIs.