1. 4
  1.  

  2. 2

    Still makes me sad that even in UTF-8 there are invalid code points. ie. You have to double inspect every damn byte if you’re doing data mining.

    Typically in data mining you are presented with source material. It’s not your material, it’s whatever is given to you.

    If somebody has screwed up the Unicode encoding, you can’t fix it. You have work with whatever hits the fan, and everything else in your ecosystem is going barf if you throw an invalid code point at it, even if it was just going to ignore it anyway.

    So you first have to inspect every byte and see if it’s a valid code point and then on the fly squash them to the special invalid thingy. ie. Double work for each byte and you can’t just mmap the file.

    Ah for The Good Old Bad Old Days of 8bit ascii.

    1. 6

      Still makes me sad that even in UTF-8 there are invalid code points. ie. You have to double inspect every damn byte if you’re doing data mining.

      I disagree. It’s an amazing feature of UTF-8 because it allows me to be certain to exclude utf-8 from a list of possible encodings a body of text might have. No other 8-bit encoding has that feature. A blob of bytes that happens to be text encoded in ISO-8859-1 looks exactly the same as a blob of bytes that is encoded in ISO-8859-3, but it can’t be utf-8 (at least when it’s using anything outside of the ASCII range).

      Ah for The Good Old Bad Old Days of 8bit ascii.

      if you need to make sense of the data you have mined, the Old Days were as bad as the new days are because you’re still stuck having to guess the encoding by interpreting the blob of bytes as different encodings and then trying to see whether the text makes sense in any of the possible languages that could have been used in conjunction with your candidate encoding.

      This is incredibly hard and error-prone.

      1. 1

        I guess I’d like a Shiny New Future where nobody tries to guess encoding, because standards bodies and software manufacturers insist on making it explicit, and all software by default splats bad code points to invalid without doing something really stupid like throwing an exception….

        Sigh.

        I guess for decades to come I’ll still remember the Good Old Bad Old days of everything is Ascii (and if it wasn’t we carried on anyway) fondly….. I’m not going to hold my breathe waiting for a sane future.

      2. 2

        Ah for The Good Old Bad Old Days of 8bit ascii.

        It wasn’t ASCII, and that’s the point: There was no way to verify what encoding you had, even if you knew the file was uncorrupted and you had a substantial corpus. You could, at best, guess at it, but since there was no way to disprove any of your guesses conclusively, that wasn’t hugely helpful.

        I remember playing with the encoding functionality in web browsers to try to figure out what a page was written in, operating on the somewhat-optimistic premise that it had a single, consistent text encoding which didn’t change partway through. I didn’t always succeed.

        UTF-8 is great because absolutely nothing looks like UTF-8. UTF-16 is fairly good because you can usually detect it with a high confidence, too, even without a BOM. UCS-4 is good because absolutely nobody uses it to store or ship text across the Internet, as far as I can tell.