1. 5
  1.  

  2. 1

    ps: Message to Unicode library authors of all languages…..

    Getting invalid bytes muddled a stream of valid unicode is not an exceptional event.

    Line noise, malice, clueless users, bad code is a every blooming second fact of life. A fact you probably not even allowed or capable of correcting.

    Expect it, quash it to an invalid character, don’t throw an exception.

    1. 4

      Disagree. Invalid bytes in a unicode stream are like syntax errors in a JSON or XML file, and should be handled the same way. Errors should not pass silently unless explicitly silenced.

      1. 1

        If you can “Halt The Line”, and go back and make it right… Yip. Fine. It’s an exception.

        If it is a users file, one of ten thousand files, or something coming in over the ‘net…. sorry, you are just going to have to live with it and make do with the rest of the code points the best you can.

        Yip it’s a parse error, so squash it to invalid and return a count of how many invalids you saw.

        Give up and refuse to process it? Do you still want a job tomorrow?

        Throwing you hands up in horror and wailing and curling up in a soggy heap, has never, in my world, been an option.

        1. 1

          This is the sort of thing where a Common Lisp / Dylan style “condition” system rather than exceptions is really nice. You can readily supply different handlers and let the caller decide what to do:

      2. 1

        Most people have realised you need to test with non-ascii characters, but don’t forget to test with astral characters.