1. 12

  2. 18

    In short, if you’re building a web application and you’re accepting input from users, you should always normalize it to a canonical form in Unicode.

    Strong disagree, you should use string comparisons that understand Unicode. You can’t achieve “intuitive matching” via normalization without discarding information from the user’s input.

    For example, the FFI ligature :

    > x = 'ffi'
    > x === 'ffi'
    > ['NFC', 'NFD', 'NFKC', 'NFKD'].map(nf => x.normalize(nf) === 'ffi')
    [false, false, true, true]

    Only the K normalization forms will make this comparison equal. However:

    > ['NFC', 'NFD', 'NFKC', 'NFKD'].map(nf => x.normalize(nf))
    ["ffi", "ffi", "ffi", "ffi"]

    … you have lost information from the user’s input. has been decomposed to ffi. If you store the user’s data normalized then you’ve thrown things away.

    Instead, use localeCompare (or Intl.Collator) to properly do language-sensitive string comparisons:

    > 'ffi'.localeCompare('ffi', 'en', { sensitivity: 'case' })

    As always, remember that equivalence is language-sensitive:

    > 'i'.localeCompare('I', 'en', { sensitivity: 'base' })
    > 'i'.localeCompare('I', 'tr', { sensitivity: 'base' })

    There are no simple one-size-fits-all solutions when dealing with text. You must consider your scenario and what you are trying to achieve. A rule like “always normalize everything” can cause trouble.

    1. 6

      One way to interpret the fine article is to normalize not when storing the data, but rather to normalize when performing string comparisons. I personally always store data as the user entered it, SQL injection attempts, XSRF attempts, and all, exactly as they entered it. I then normalize the data when in use, i.e. when comparing passwords, or outputting HTML. The only place where I would suggest normalizing text is in a separate full-text search storage engine, such as ElasticSearch.

      1. 1

        (Yes, that’s better, although I’m also fairly sure that NKDC/NKFC fail for comparisons in other ways, but I didn’t have time to come up with any.)

      2. 1

        When creating a file on macOS with HFS+, the filesystem driver always normalizes it and stores its filename in NFD form. So, it decomposes the characters if the user wants to create a filename passing a composed form. Then, user may access the file by passing both composed filename and decomposed filename.

        (Haven’t checked if that’s the case with APFS)

        1. 2

          APFS: store as is, compare under normalization

      3. 3

        the length examples are highly misleading. first they give this:


        which yields “2”. Then supposedly you can “fix” this by normalizing:


        however that only works in the case of combining characters. With normal SMP characters it still chokes:

        1. 2

          There seems to be an implied premise that only one person can be named Zoë? What if there’s another Zoë the next town over? If you are using unicode strings as unique identifiers (not personally recommended), you have many more issues to worry about than just normalization.

          1. 4

            Obviously Zoë is the derivative of Zoė and the second derivative of Zoe. By induction, there can be only one.

            1. 5

              You are assuming Zoe is everywhere both smooth and continuous. Can you guarantee this?

              1. 5

                I have not seen Zoe since high school, so things may have changed.

          2. 1

            If only encoding standards had been developed in German or Chinese, or basically any language more complex than English. Then maybe we would have had sane encoding from day one, rather than having the assumption that every text is written in English, with no accent, etc.

            1. 1

              I love that this just works in perl 6. Very modern, very nice.