1. 33

    As a software engineer, Unicode puts a lot of complexity on my table and much of that I really wouldn’t need.

    The “I” in that sentence is significant. Seeing Unicode as an unnecessary imposition is basically an admission that you only care about English (or computer languages/syntaxes based on English.) There are a lot of equivalent complaints about e.g. date/time formats, or all the different schemas for people’s names and addresses. And you can go further, by complaining about having to deal with accessibility.

    (There’s a joke that emoji were invented as a way to make American programmers finally have to give a shit about Unicode.)

    1. 17

      This prompted me to write a post about encoding issues in the pre-unicode Russia.

      1. 3

        Thanks for this. I had no idea KOI8-R was in a different order than the actual alphabet.

        1. 3

          Great article! I just submitted it.

        2. 13

          The “I” in that sentence is significant. Seeing Unicode as an unnecessary imposition is basically an admission that you only care about English (or computer languages/syntaxes based on English.)

          And not just languages other than English, but also everything that’s not, specifically, of English origin, which is actually a significant hurdle even to native English speakers in English-speaking countries.

          It’s very easy to underestimate how important this is if you’re only writing (most types of) software, especially if you’re only writing programmer-facing software. Many years ago I also thought ASCII was actually very much sufficient for a long time, even though I’m not a native English speaker and my own name is spelled with a stupid character that isn’t in CP437. I was happy to just write it without the diacritics, it was easy to figure out.

          Well… it’s not quite like that everywhere. If you’re doing history or any of its myriad of “auxiliary sciences”, from archaeology to numismatics and from genealogy to historiography, if you’re doing philosophy, linguistics (obviously!), geography, philology of any kind, you’re going to hit Unicode pretty quickly. Even if you’re doing it all in English, for an English-speaking audience, and even if you’re not doing it in any “pure” form – there’s lots of software these days that powers museums (from administration/logistics to exhibitions), there’s GIS software of all kind, there’s software for textual analysis and Bible study and so on. That gives you a huge amount of software that needs to be Unicode-aware, from word processors and email clients to web browsers and graphics programs.

          I remember pre-Unicode software that handled all this. I never seriously used it, it was kind of before my time, but I could tell it was not just extraordinarily awkward to use, it was extremely brittle. You’d save something on one computer, email it give the floppy to a colleague and all the ăs and îs would show up as ⊵s and ☧s.

          And that’s leaving out “mundane” things like software that handles patient records and which needs to be aware of these things because you never know when it’ll get big and it’ll be really embarrassing when extending into the Chinese market is going to need a massive development and testing budget because you built a unicorn on software that only recognizes like 30 letters or so. It’s a huge, real-life problem, even for software that’s built specifically for native English speakers who live in English-speaking countries.

          1. 3

            Because my bank uses TIS encoding on their back-end despite ‘supporting’ English, I can’t send support messages with unsupported characters like commas, or dashes, or colons, etc. Using ASCII is definitely the same issue.

            1. 2

              Have you read the addendum at the end of the post?

              Some people read this as argument against Unicode, which it is not. I don’t want to come back to ISO-8859-1, because that sucks. Also, I’m ready to deal with lot of complexity to allow people write their names properly.


              1. 1

                Kudos for the author for clarifying their position better. Thanks for bringing this to our attention.

              2. 1

                With encodings like ISO-8859-1 you can cover most European languages, which includes most (all?) of the Americas and a chuck of Africa and Asia as well; you don’t need to be stuck with ASCII from 1966, and in practice most software hasn’t been since the 90s and the effort to make things “8bit clean”.

                For a surprising number of businesses/software, it turns out you don’t really need Unicode to allow people to enter all the reasonable content they need to enter, and since Unicode is somewhat complex it turns out a lot of “Unicode clean” solutions aren’t in a whole list of edge cases.

                1. 8

                  Immigrants from other countries might disagree; if you’re a Vietnamese-American or Turkish-German, 8859-1 may not allow you to spell your own name correctly.

                  It’s also missing a lot of common typographic niceties like typographer’s (curly) quotes and, IIRC, em-dashes. Which is a deal-breaker in some markets, like publishing. It also messes up text that was written in any word processor (or even Apple’s TextEdit) with smart quotes enabled.

                  Really, UTF-8 is pretty easy to work with, and modern string libraries handle Unicode well, so I don’t see any compelling reason to stick with an older limited encoding.

                  1. 7

                    When you make the decision to do only ISO-8859-1, you are also – whether you explicitly think about it or not – making the decision to exclude from your user/customer base everyone on earth whose preferred language, or even just name, uses script(s) not covered by it. Even in the US or Europe this will exclude significant numbers of people; for example, in the US there are two languages with more than one million resident speakers which do not use any form of Latin script, and a third which requires diacritics not supported by ISO-8859-1.

                    And depending on your field of business, excluding these people, or making it significantly more difficult for them to work, with you, may actually be illegal. There is no excuse, in $CURRENT_YEAR, for not just putting in the work to understand Unicode and handle it properly.

                    1. 6

                      I was a speaker at an O’Reily conference once, and they had difficulty printing a badge for me, because my last name requires ISO-8859-2.

                    2. 1

                      Seeing Unicode as an unnecessary imposition is basically an admission that you only care about English

                      Don’t say that. If people care as they should, the real question is what «supporting unicode» necessitates for what your particular software does. For a lot of software, it suffices to be encoding agnostic.

                      A good example is filenames on Unix. They are a string of bytes, and it is in fact wrong to assume any encoding for them. As an example, all my files from the pre-unicode era (early 2000s) are of course not valid UTF-8, despite my modern locale being UTF-8.

                      1. 3

                        A filename is not an unencoded blob, even if filesystem implementors might want it to be for simplicity. Filenames are displayed to, and typed in by, users. That requires an encoding. If the filesystem doesn’t specify one, it’s just passing the buck to higher level software, which then has to deal with inconsistencies or undisplayable names. I’m sure this is great “fun” when it goes wrong.

                        For comparison, macOS — which is a Unix, albeit one that actually prioritizes UX — requires filenames to be UTF-8 in a specific canonical form (so that accents and ligatures are encoded in a consistent way.)

                        1. 2

                          In terms of specs and standards, the only thing all Unix filenames are required to have in common is that they are byte sequences. No Unix-wide standard exists which requires that filesystems use a consistent encoding, nor that requires the system to have or be able to tell you about such an encoding, nor even that there be any encoding at all. It’s true that many modern Unixes do in fact have a consistent encoding and ways to discover what that encoding is (or just default to UTF-8 and allow tooling to do the same), but there is, again, absolutely no requirement that they do so.

                          This is why, for example, in Rust there’s a completely separate string-ish type for handling file paths, which drops all guarantees of being valid UTF-8 (as all other Rust strings are) or even of being decodable at all. This is why, when Python 3 went to Unicode as its string type, there was a long and difficult process that finally led to the surrogateescape error handler (which allows Python to represent file paths in its string type while losslessly preserving non-decodable bytes and byte sequences). And so on and so forth.

                          And it’s not really a problem that can be solved by any popular language or tool, because while there are not that many people who actually use non-decodable byte sequences in practice, the people who do tend to be aggressively vocal about “the specs allow me to do it, so you are required to support it forever” and making maintainers’ lives unpleasant until they finally give in.

                          1. 2

                            Reading between the lines a bit, it seems you agree with me that this is a really bad situation. Which contradicts @anordal’s assertion that “it suffices to be encoding agnostic”.

                            1. 1

                              I’m simply making a factual assertion about Unix filenames, which – if you intend to be portable – indeed must be treated as opaque possibly-undecodable bags of bytes, because that’s the only thing you can rely on without running into someone waving an ancient spec and saying “but technically I’m allowed to do this…”

                              1. 1

                                Maybe I should have elaborated with an example, because it depends on the core functionality of the software. For example, if we are implementing the mv command, the core functionality – the definition of “correct” if you will, is that renaming even broken filenames works.

                                The situation is that people have gigabytes upon gigabytes of files named to the tune of “DJ Ti�sto” and other mojibake, and we have to deal with it.

                                Okay, even the mv command might want to print a filename if something is wrong, but doing this 100% correctly is an unimportant niche feature – the exit status is the real API, strictly speaking. As any feature, we must prioritize this as time and space permits: For human consumption, it suffices to write those bytes to the terminal or do some lossy conversion to make it valid if that keeps it simple. In another case, the concern might even be difficulty and correctness itself (as in the original post). So yes, even in this example, you have a point! Just saying that if it doesn’t count towards the program’s correctness, other tradeoffs may take precedence.