One wrinkle that you may not realize from reading this article is that different human languages have different sort orders for the same characters* — for example, IIRC the character “å” has a different position in the alphabet in Swedish and Norwegian — so proper string sorting requires knowing what language the strings are in.
…which is not necessarily the same as the system’s current locale. If I’m bilingual I may have my OS configured for English but still work with a lot of German text.
* Oh, and this doesn’t just happen for those weird foreign characters, it can apply to ASCII too. Spanish has special sort rules for “ch” and “ll” — basically they get treated as though they were a single letter that comes after “c” and “l” respectively.
(Again IIRC. I worry I’m getting these examples wrong from memory. The ICU documentation, which is where I learned this, has the straight dope.)
This doesn’t detract from the main point of your post.
Spanish has special sort rules for “ch” and “ll” — basically they get treated as though they were a single letter that comes after “c” and “l” respectively.
ch and ll have not being considered their own letter in Spanish for around ~25 years. I was in Elementary school when they stop being their own letters (I want to say 94). Also (this I’m less sure of as I was starting to use dictionaries back then) they didn’t affect the sorting. In the dictionary a word with ‘ce’ in the middle would be found before a world with ch in the middle. (ej. ‘hace’ > ‘hacha’). The only difference is that a word starting with ch would be in a separate section in the dictionary.
One wrinkle that you may not realize from reading this article is that different human languages have different sort orders for the same characters* — for example, IIRC the character “å” has a different position in the alphabet in Swedish and Norwegian — so proper string sorting requires knowing what language the strings are in.
…which is not necessarily the same as the system’s current locale. If I’m bilingual I may have my OS configured for English but still work with a lot of German text.
* Oh, and this doesn’t just happen for those weird foreign characters, it can apply to ASCII too. Spanish has special sort rules for “ch” and “ll” — basically they get treated as though they were a single letter that comes after “c” and “l” respectively.
(Again IIRC. I worry I’m getting these examples wrong from memory. The ICU documentation, which is where I learned this, has the straight dope.)
Is there a similar sorting issue for the Dutch “ij” which is I think the ascii-indication of ÿ?
I found a big list of language collation rules, and it says “ij” isn’t treated specially any more … except in phone books.
This doesn’t detract from the main point of your post.
ch and ll have not being considered their own letter in Spanish for around ~25 years. I was in Elementary school when they stop being their own letters (I want to say 94). Also (this I’m less sure of as I was starting to use dictionaries back then) they didn’t affect the sorting. In the dictionary a word with ‘ce’ in the middle would be found before a world with ch in the middle. (ej. ‘hace’ > ‘hacha’). The only difference is that a word starting with ch would be in a separate section in the dictionary.