1. 14
  1.  

  2. 7

    One wrinkle that you may not realize from reading this article is that different human languages have different sort orders for the same characters* — for example, IIRC the character “å” has a different position in the alphabet in Swedish and Norwegian — so proper string sorting requires knowing what language the strings are in.

    …which is not necessarily the same as the system’s current locale. If I’m bilingual I may have my OS configured for English but still work with a lot of German text.

    * Oh, and this doesn’t just happen for those weird foreign characters, it can apply to ASCII too. Spanish has special sort rules for “ch” and “ll” — basically they get treated as though they were a single letter that comes after “c” and “l” respectively.

    (Again IIRC. I worry I’m getting these examples wrong from memory. The ICU documentation, which is where I learned this, has the straight dope.)

    1. 2

      Is there a similar sorting issue for the Dutch “ij” which is I think the ascii-indication of ÿ?

      1. 2

        I found a big list of language collation rules, and it says “ij” isn’t treated specially any more … except in phone books.

      2. 2

        This doesn’t detract from the main point of your post.

        Spanish has special sort rules for “ch” and “ll” — basically they get treated as though they were a single letter that comes after “c” and “l” respectively.

        ch and ll have not being considered their own letter in Spanish for around ~25 years. I was in Elementary school when they stop being their own letters (I want to say 94). Also (this I’m less sure of as I was starting to use dictionaries back then) they didn’t affect the sorting. In the dictionary a word with ‘ce’ in the middle would be found before a world with ch in the middle. (ej. ‘hace’ > ‘hacha’). The only difference is that a word starting with ch would be in a separate section in the dictionary.