1. 8

  2. 10

    Crude code point sorting is a sensible default if you expect consistent results, and in many cases I believe that’s exactly what programmers expect. Imagine if your binary search didn’t work on my dataset because we used different locale settings. You can opt in if you want it, but the problems introduced by imposing locale behavior on programs not expecting it are much worse than aesthetic.

    You’ll occasionally come across the accidentally locale aware code generator. Fun times when it prints floating point numbers as 3,14.

    1. 2

      It’s dependent on what you’re going to use the sorted result for. A binary search doesn’t care what you’re actually sorting on, as long as it’s a total order and it’s fast. User display needs to be locale aware and consistent with user expectations. Implementing a spec (bencode, for example) that demands sorting requires you to implement it exactly as they specify.

      So, it’s all about context. As you probably expected.

    2. 3

      Unicode normalization solves this problem. I guess a simple algo would transform the set into [[original, normalized], ...] sort it based on the second element, then just ‘unwrap it’ to [original, ...]. It’s a shame that not many new languages do this, but then again, not many new languages know how to deal with unicode all that well imo.

      1. 4

        There’s a counterexample right in the introduction of https://unicode.org/reports/tr10/. Blindly applying “unicode” sort only shuffles things around but still isn’t right.

        Languages vary regarding which types of comparisons to use (and in which order they are to be applied), and in what constitutes a fundamental element for sorting. For example, Swedish treats ä as an individual letter, sorting it after z in the alphabet; German, however, sorts it either like ae or like other accented forms of a, thus following a. In Slovak, the digraph ch sorts as if it were a separate letter after h.

        1. 1

          Is normalization the same as collation in this case?

        2. 2

          Not really impressed with this post, to be honest…

          Returning characters by code point sorting is eminently reasonable, and then you hand the actual display over to a collation library. Demanding that é sorts immediately after e just hardcodes a collation assumption into the language!

          However, I am not sure what the equivalent is in Python and Swift. It does not jump at me in the documentation of the respective standard librairies. I did not even look at it for other popular programming languages like Go, C++, and so forth.

          This statement, while honest, smacks of laziness.

          Instead of complaining about how JS sorts numbers (alphabetically by default) I’d be more interested if the author had actually given an example on how to use collation code in JS or any other language.