1. 17
  1.  

  2. 4
    1. 5

      If as a website user I’m looking at a drop-down list of Polish cities, I want to see Łódź somewhere between Lublin and Malbork, not somewhere after Zamość.

      1. 2

        Is the database the right place to solve that? What if a German user asks for a list of German cities? Do we create a separate table for every country?

        1. 2

          In Postgres, you can use collation-aware sorting at the query level (like this: SELECT * FROM regexp_split_to_table('abcćdeęfgoöä', '') s (l) ORDER BY s.l COLLATE "de_DE.utf8";), at the column level (if, for example, you have table “products” and columns “german_name”, “russian_name” and so on) and at the database level. The server will use the most specific one. Indexes can be sorted according to a collation, too, so you can have multiple indexes on the same column but with different collations.

          1. 1

            If you’re generating a per-country city list, sorting that list (what max ~ 200 entries?) by locale specific string sort order should be trivial. Unless you’re in a world where switching locale is extremely heavyweight?

            1. 1

              Yes, in many cases it’s the right place. I believe one of the most common applications will be pagination + ordering. If you can’t do ORDER BY ... LIMIT ... at the database level and expect meaningful results then the alternative is fetching all rows, ordering them in the app, and slicing the resulting array to get the results you want. For larger data sets this isn’t feasible.

              The problem of handling multiple locales in a single app is definitely interesting and I don’t have an easy answer. It’s easy to enforce some order on the rows but it doesn’t mean it’ll make sense to the user.

        2. 4

          Also, note that this problem can bite even if you don’t use streaming replication. I’ve had the opportunity to clean up after an upgrade from ICU-enabled Postgres to a clean one. So of course, collation-sensitive existing btree indexes became immediately corrupted from the point of view of the new binary. The user-visible effect was that some unlucky people weren’t able to log in (the login action was based on a query like SELECT ... FROM users WHERE login = 'pstef' which was unable to find the tuple in the index) and registered again using the same login (because their logins apparently weren’t taken - again, the query didn’t find it in the index). So then, after the root cause has been found and fixed, the duplicates had to be dealt with. Fun times.

          1. 3

            I’ll quote what Thomas Munro (a PG and FreeBSD hacker) said in the HN thread about this story. It’s one of the topics he’s invested in.

            PostgreSQL has the beginnings of a solution to this problem: version tracking for collations. I think it should be extended to track versions in a more fine grained way so that it can detect and reject this scenario (and related scenarios), and I think that libc implementations should provide a way to expose the version. I have proposed this for FreeBSD libc, and I hope someone proposes something similar for glibc (maybe me eventually). I think PostgreSQL should continue to support both ICU and libc collations (I don’t think it’s reasonable for every piece of software to use its own collation system, or for everyone to switch to ICU, I think libc should do a slightly better job).

            More recent discussion of various possible approaches on PostgreSQL -hackers list: https://www.postgresql.org/message-id/flat/CAEepm%3D0uEQCpfq

            Proposal to add versions to libc collations in FreeBSD: https://reviews.freebsd.org/D17166

            1. 2

              This mailing this thread is from August 2014, and it’s been almost 4 and half years since then. Is this still relevant?

              1. 1

                They are still making new Unicode standards, so at least as far as strcoll, that can change at any time.

              2. 1

                Wow. This is outrageously dangerous. I wonder what the hell glibc is doing to be so wildly confused, even in different builds of the same version.

                1. 3

                  Unicode collations aren’t guaranteed to have a stable sort across releases. Presumably glibc packaged a newer unicode table?

                  The real issue is that postgres assumes that any given collation name (eg en-US) is the same collation across machines. That’s vanishingly unlikely to be true unless they were built from the exact same packages.

                  1. 3

                    That doesn’t explain why changing the linkage of the exact same version of the library changed the order.

                    -1 -> 2.17-93ubuntu4
                    15 -> 2.17-93ubuntu4 statically linked