Going to shamelessly plug this write-up (and prior discussion of it on this site) drawn from things I’ve learned over the years for more context on the complexity of case in a world with more than ASCII. The Turkish dotted/dotless “i” is a useful example, but just handling that won’t get you all the way, or anywhere close to all the way, to proper case handling.
And there’s also the fact — which I didn’t cover in that write-up — that which of the multiple available options you want is often sensitive to what you’re trying to accomplish. For example, comparing strings that are intended for use as security-sensitive identifiers is a different problem, with different constraints and recommended approaches, than title-casing an article’s headline.
Having worked a lot on this some time ago (I implemented a consumer replicated filesystem that supported a case-insensitive mode across different OSs), this write-up is great.
My personal (shorter) take on this:
“case-insensitive” comparison is not well-defined. The “case-insensitive” part depends on the context and must be explicitly specified.
If you need to compare strings “case-insensitively” and deal with Unicode subtleties, you need to know what case folding and Unicode normalization are among other things (the write-up in the parent post is a great start).
Every single implementer of this made mistakes at first. Don’t implement it yourself, use a library like utf8proc. Or at least read it first.
If you do it for UX reasons, you can use system-provided functions, but if you do it to adhere to specs / match another system’s behavior they will almost always do the wrong thing.
Proper Unicode case-insensitive comparison is hard, and proper case-insensitive indexing is even harder (because compare_insensitive(a, b) is not something like compare(lowercase(a), lowercase(b)).
Proper Unicode case-insensitive comparison is super compute-expensive, so if you know you can avoid it somehow do it. This is what cURL does here (because they know they can afford to ignore non-ASCII cases). And if you cannot, consider clever caching. Yes, it is that slow.
I’ve tried hard to stay away from any of the locale or non-ASCII APIs in C/C++ — they seem to be a mess and get in the way more than they help. Making the venerable low-level C functions like strcasecmp and isalpha change their behavior on ASCII characters due to some global state was a terrible idea.
To really deal with I18N you need purpose-built APIs like ICU, or the various text APIs on Apple platforms, and I’m sure there are others. Core C/C++ was built long ago around ASCII and should stick with it.
To really deal with I18N you need purpose-built APIs like ICU, or the various text APIs on Apple platforms, and I’m sure there are others. Core C/C++ was built long ago around ASCII and should stick with it.
The _l-sufficed versions of the APIs in POSIX 2008 originate with Apple. They appear to be designed explicitly to be able to support NSLocale. The C++11 locale APIs were then designed as a thin wrapper around POSIX 2008 locales (annoyingly, the way that facets were implemented in libc++ doesn’t take advantage of this very well and ends up with redundant copies of some very large libc structures). So, if you like OpenStep’s locale handling, you’ll probably like the current C/C++ APIs.
Going to shamelessly plug this write-up (and prior discussion of it on this site) drawn from things I’ve learned over the years for more context on the complexity of case in a world with more than ASCII. The Turkish dotted/dotless “i” is a useful example, but just handling that won’t get you all the way, or anywhere close to all the way, to proper case handling.
And there’s also the fact — which I didn’t cover in that write-up — that which of the multiple available options you want is often sensitive to what you’re trying to accomplish. For example, comparing strings that are intended for use as security-sensitive identifiers is a different problem, with different constraints and recommended approaches, than title-casing an article’s headline.
Having worked a lot on this some time ago (I implemented a consumer replicated filesystem that supported a case-insensitive mode across different OSs), this write-up is great.
My personal (shorter) take on this:
“case-insensitive” comparison is not well-defined. The “case-insensitive” part depends on the context and must be explicitly specified.
If you need to compare strings “case-insensitively” and deal with Unicode subtleties, you need to know what case folding and Unicode normalization are among other things (the write-up in the parent post is a great start).
Every single implementer of this made mistakes at first. Don’t implement it yourself, use a library like utf8proc. Or at least read it first.
If you do it for UX reasons, you can use system-provided functions, but if you do it to adhere to specs / match another system’s behavior they will almost always do the wrong thing.
Proper Unicode case-insensitive comparison is hard, and proper case-insensitive indexing is even harder (because compare_insensitive(a, b) is not something like compare(lowercase(a), lowercase(b)).
Proper Unicode case-insensitive comparison is super compute-expensive, so if you know you can avoid it somehow do it. This is what cURL does here (because they know they can afford to ignore non-ASCII cases). And if you cannot, consider clever caching. Yes, it is that slow.
I’ve tried hard to stay away from any of the locale or non-ASCII APIs in C/C++ — they seem to be a mess and get in the way more than they help. Making the venerable low-level C functions like strcasecmp and isalpha change their behavior on ASCII characters due to some global state was a terrible idea.
To really deal with I18N you need purpose-built APIs like ICU, or the various text APIs on Apple platforms, and I’m sure there are others. Core C/C++ was built long ago around ASCII and should stick with it.
The
_l
-sufficed versions of the APIs in POSIX 2008 originate with Apple. They appear to be designed explicitly to be able to supportNSLocale
. The C++11 locale APIs were then designed as a thin wrapper around POSIX 2008 locales (annoyingly, the way that facets were implemented in libc++ doesn’t take advantage of this very well and ends up with redundant copies of some very large libc structures). So, if you like OpenStep’s locale handling, you’ll probably like the current C/C++ APIs.