1. 15
  1.  

  2. 33

    As a software engineer, Unicode puts a lot of complexity on my table and much of that I really wouldn’t need.

    The “I” in that sentence is significant. Seeing Unicode as an unnecessary imposition is basically an admission that you only care about English (or computer languages/syntaxes based on English.) There are a lot of equivalent complaints about e.g. date/time formats, or all the different schemas for people’s names and addresses. And you can go further, by complaining about having to deal with accessibility.

    (There’s a joke that emoji were invented as a way to make American programmers finally have to give a shit about Unicode.)

    1. 17

      This prompted me to write a post about encoding issues in the pre-unicode Russia.

      1. 3

        Thanks for this. I had no idea KOI8-R was in a different order than the actual alphabet.

        1. 3

          Great article! I just submitted it.

        2. 13

          The “I” in that sentence is significant. Seeing Unicode as an unnecessary imposition is basically an admission that you only care about English (or computer languages/syntaxes based on English.)

          And not just languages other than English, but also everything that’s not, specifically, of English origin, which is actually a significant hurdle even to native English speakers in English-speaking countries.

          It’s very easy to underestimate how important this is if you’re only writing (most types of) software, especially if you’re only writing programmer-facing software. Many years ago I also thought ASCII was actually very much sufficient for a long time, even though I’m not a native English speaker and my own name is spelled with a stupid character that isn’t in CP437. I was happy to just write it without the diacritics, it was easy to figure out.

          Well… it’s not quite like that everywhere. If you’re doing history or any of its myriad of “auxiliary sciences”, from archaeology to numismatics and from genealogy to historiography, if you’re doing philosophy, linguistics (obviously!), geography, philology of any kind, you’re going to hit Unicode pretty quickly. Even if you’re doing it all in English, for an English-speaking audience, and even if you’re not doing it in any “pure” form – there’s lots of software these days that powers museums (from administration/logistics to exhibitions), there’s GIS software of all kind, there’s software for textual analysis and Bible study and so on. That gives you a huge amount of software that needs to be Unicode-aware, from word processors and email clients to web browsers and graphics programs.

          I remember pre-Unicode software that handled all this. I never seriously used it, it was kind of before my time, but I could tell it was not just extraordinarily awkward to use, it was extremely brittle. You’d save something on one computer, email it give the floppy to a colleague and all the ăs and îs would show up as ⊵s and ☧s.

          And that’s leaving out “mundane” things like software that handles patient records and which needs to be aware of these things because you never know when it’ll get big and it’ll be really embarrassing when extending into the Chinese market is going to need a massive development and testing budget because you built a unicorn on software that only recognizes like 30 letters or so. It’s a huge, real-life problem, even for software that’s built specifically for native English speakers who live in English-speaking countries.

          1. 3

            Because my bank uses TIS encoding on their back-end despite ‘supporting’ English, I can’t send support messages with unsupported characters like commas, or dashes, or colons, etc. Using ASCII is definitely the same issue.

            1. 2

              Have you read the addendum at the end of the post?

              Some people read this as argument against Unicode, which it is not. I don’t want to come back to ISO-8859-1, because that sucks. Also, I’m ready to deal with lot of complexity to allow people write their names properly.

              etc

              1. 1

                Kudos for the author for clarifying their position better. Thanks for bringing this to our attention.

              2. 1

                With encodings like ISO-8859-1 you can cover most European languages, which includes most (all?) of the Americas and a chuck of Africa and Asia as well; you don’t need to be stuck with ASCII from 1966, and in practice most software hasn’t been since the 90s and the effort to make things “8bit clean”.

                For a surprising number of businesses/software, it turns out you don’t really need Unicode to allow people to enter all the reasonable content they need to enter, and since Unicode is somewhat complex it turns out a lot of “Unicode clean” solutions aren’t in a whole list of edge cases.

                1. 8

                  Immigrants from other countries might disagree; if you’re a Vietnamese-American or Turkish-German, 8859-1 may not allow you to spell your own name correctly.

                  It’s also missing a lot of common typographic niceties like typographer’s (curly) quotes and, IIRC, em-dashes. Which is a deal-breaker in some markets, like publishing. It also messes up text that was written in any word processor (or even Apple’s TextEdit) with smart quotes enabled.

                  Really, UTF-8 is pretty easy to work with, and modern string libraries handle Unicode well, so I don’t see any compelling reason to stick with an older limited encoding.

                  1. 7

                    When you make the decision to do only ISO-8859-1, you are also – whether you explicitly think about it or not – making the decision to exclude from your user/customer base everyone on earth whose preferred language, or even just name, uses script(s) not covered by it. Even in the US or Europe this will exclude significant numbers of people; for example, in the US there are two languages with more than one million resident speakers which do not use any form of Latin script, and a third which requires diacritics not supported by ISO-8859-1.

                    And depending on your field of business, excluding these people, or making it significantly more difficult for them to work, with you, may actually be illegal. There is no excuse, in $CURRENT_YEAR, for not just putting in the work to understand Unicode and handle it properly.

                    1. 6

                      I was a speaker at an O’Reily conference once, and they had difficulty printing a badge for me, because my last name requires ISO-8859-2.

                    2. 1

                      Seeing Unicode as an unnecessary imposition is basically an admission that you only care about English

                      Don’t say that. If people care as they should, the real question is what «supporting unicode» necessitates for what your particular software does. For a lot of software, it suffices to be encoding agnostic.

                      A good example is filenames on Unix. They are a string of bytes, and it is in fact wrong to assume any encoding for them. As an example, all my files from the pre-unicode era (early 2000s) are of course not valid UTF-8, despite my modern locale being UTF-8.

                      1. 3

                        A filename is not an unencoded blob, even if filesystem implementors might want it to be for simplicity. Filenames are displayed to, and typed in by, users. That requires an encoding. If the filesystem doesn’t specify one, it’s just passing the buck to higher level software, which then has to deal with inconsistencies or undisplayable names. I’m sure this is great “fun” when it goes wrong.

                        For comparison, macOS — which is a Unix, albeit one that actually prioritizes UX — requires filenames to be UTF-8 in a specific canonical form (so that accents and ligatures are encoded in a consistent way.)

                        1. 2

                          In terms of specs and standards, the only thing all Unix filenames are required to have in common is that they are byte sequences. No Unix-wide standard exists which requires that filesystems use a consistent encoding, nor that requires the system to have or be able to tell you about such an encoding, nor even that there be any encoding at all. It’s true that many modern Unixes do in fact have a consistent encoding and ways to discover what that encoding is (or just default to UTF-8 and allow tooling to do the same), but there is, again, absolutely no requirement that they do so.

                          This is why, for example, in Rust there’s a completely separate string-ish type for handling file paths, which drops all guarantees of being valid UTF-8 (as all other Rust strings are) or even of being decodable at all. This is why, when Python 3 went to Unicode as its string type, there was a long and difficult process that finally led to the surrogateescape error handler (which allows Python to represent file paths in its string type while losslessly preserving non-decodable bytes and byte sequences). And so on and so forth.

                          And it’s not really a problem that can be solved by any popular language or tool, because while there are not that many people who actually use non-decodable byte sequences in practice, the people who do tend to be aggressively vocal about “the specs allow me to do it, so you are required to support it forever” and making maintainers’ lives unpleasant until they finally give in.

                          1. 2

                            Reading between the lines a bit, it seems you agree with me that this is a really bad situation. Which contradicts @anordal’s assertion that “it suffices to be encoding agnostic”.

                            1. 1

                              I’m simply making a factual assertion about Unix filenames, which – if you intend to be portable – indeed must be treated as opaque possibly-undecodable bags of bytes, because that’s the only thing you can rely on without running into someone waving an ancient spec and saying “but technically I’m allowed to do this…”

                              1. 1

                                Maybe I should have elaborated with an example, because it depends on the core functionality of the software. For example, if we are implementing the mv command, the core functionality – the definition of “correct” if you will, is that renaming even broken filenames works.

                                The situation is that people have gigabytes upon gigabytes of files named to the tune of “DJ Ti�sto” and other mojibake, and we have to deal with it.

                                Okay, even the mv command might want to print a filename if something is wrong, but doing this 100% correctly is an unimportant niche feature – the exit status is the real API, strictly speaking. As any feature, we must prioritize this as time and space permits: For human consumption, it suffices to write those bytes to the terminal or do some lossy conversion to make it valid if that keeps it simple. In another case, the concern might even be difficulty and correctness itself (as in the original post). So yes, even in this example, you have a point! Just saying that if it doesn’t count towards the program’s correctness, other tradeoffs may take precedence.

                      2. 14

                        This article sounds like it was written in 2002 before there were good String types in every modern programming language. If the fact that characters can take more than one byte is a problem for you, perhaps you should consider a tool that lets you program at a higher level of abstraction.

                        Also unstated: how Unicode helps to improve your software. As in “supports all the world’s languages”. Given the author’s own affiliation with the University of Jyväskylä I think he’d appreciate support for something other than ASCII himself. Related: an oral history of how emoji has driven proper Unicode support beyond Plane 0.

                        1. 2

                          Given the author’s own affiliation with the University of Jyväskylä I think he’d appreciate support for something other than ASCII himself.

                          This is why they’re talking about ISO-8859

                        2. 8

                          Try to talk about Unicode without actually setting an encoding scheme. Try to talk about a “String” type without actually defining what it means.

                          1. 6

                            “Oh, a string is just an array of characters.”

                            “What’s a character?”

                            “Oh, just a letter or number or punctuation mark. You know.”

                            1. 1

                              A string is a sequence of code points. Or a sequence of grapheme clusters.

                              There are popular, widely-used programming languages which adopt one of these as the abstraction exposed to the programmer for the string type. There are languages which don’t, and which insist that a “string” is really just a special case of a byte array. It seems likely you prefer the latter, but that doesn’t cause the former to stop existing or stop working.

                            2. 5

                              I’m not really sure how the problems faced by this developer are the fault of Unicode in particular.

                              Instead, it feels as if they want to keep coding they way they have (basic protection against SQL injection, having unique usernames as opposed to (probably) numeric IDs, only accommodating users using at most ISO-8859-15[1] as a character set…

                              So the author latches upon Unicode as a proxy for other complexities that they feel they have to deal with but don’t really want to bother with.

                              [1] the author’s name seems Finnish to me, so I’d assume this is what they’d need in Finland.

                              1. 4

                                I sometimes joke that “Latin unification” would’ve been a good idea in Unicode, in order to prevent the issue where (say) a Cyrillic “a” looks identical to an ASCII “a”, even though they are assigned different unicode code points because they are in an important sense members of different alphabets. It happens that in the font used on that website on my machine, the two p’s in "tyрeablе" == "typeable" do look different, I assume because I haven’t tried to configure Cyrilic fonts at all on this computer yet (there are a number of non-Latin scripts I can read and care about my computer displaying correctly, but Cyrilic is not one of them).

                                Of course even if Unicode had been designed such that (say) Latin, Greek, and Cyrilic “a” all used the same codepoint, it would be difficult to make it so that every single unicode character “canonically” looks different. And that would be a fool’s errand anyway, since nothing actually stops an individual font-designer from making the glyphs for (say) k and the Kelvin symbol identical, or making two emoji look exactly the same, or any other graphical decision that could lead to confusion when an end-user is reading text rendered with that font.

                                It’s also worth noting that this article glosses over the distinction between “Unicode” and “UTF-8”. UTF-8 is the most common way to encode sequences of Unicode code points as bytes, but it’s not the only way. If you really want to avoid having variable-length encoding issues, you could encode your text in UTF-32, where every character is a nice, static 4 bytes long. Wasteful, perhaps, for texts primarily using ASCII characters, but you could do it if you want.

                                1. 3

                                  I know you’re joking but in case anyone’s considering this seriously: there are several good reasons why this didn’t happen, including:

                                  • Allowing for correct case conversion. E.g. Basic Latin A’s lowercase is a, but Greek A’s lowercase is α. You need to know “which A” we’re talking about in order to get to its corresponding lowercase.
                                  • Preserving ordering and, thus, allowing us to easily sort things in alphabetical order. In Basic Latin, Z is the last letter of the alphabet, but in the Greek alphabet Z is the sixth letter in the alphabet, right after E (no kidding), and right before H (which is nothing like the Latin H, it’s actually a vowel).

                                  (Edit: I’m not sure why it’s like that, I’m not a native Greek speaker, either, and I barely speak any for that matter)

                                  1. 3

                                    Right, this is a design decision with tradeoffs, just like the design decision to unify Han characters used across different east asian countries was a design decision with tradeoffs. Even with the system as it was decided upon, there are issues with case-conversion across different languages (like how to handle German ß, which as I understand it is used differently depending on whether you’re in Germany or Switzerland, and which has had several orthographic reforms post-dating the earliest versions of Unicode). The rules for alphabetical ordering as well differ from language to language, including between languages using variants of the Latin script, and those rules have been subject to orthographic reform in various countries in recent decades too.

                                    (Edit: I’m not sure why it’s like that, I’m not a native Greek speaker, either, and I barely speak any for that matter)

                                    Just today the latest episode of Word Safari dropped, wherein historical linguists Luke Gorton and Jackson Crawford discuss the history of the western alphabet over the timespan from the Phonecians to to the early Roman empire. As it happens, they answer exactly those questions about why Z is in different positions in the Greek and Latin alphabets, and why the H grapheme is different between Greek and Latin.

                                    1. 1

                                      As it happens, they answer exactly those questions about why Z is in different positions in the Greek and Latin alphabets, and why the H grapheme is different between Greek and Latin.

                                      Oh, that’s so cool! I thought it might have something to do with that but I never really looked into it, one of the cool things about being just a history buff instead of an actual historian is that it’s a lot easier to restrict your study to the things you’re really curious about, and this one just never made it high enough on the list :-D.

                                  2. 2

                                    The Soviet version of the Ada-83 standard actually mandated that in the source code, any two letters that look the same in Latin and Cyrillic must be treated as the same letter. I’m pretty sure it made the life of lexer writers much more interesting, but I’m not sure if it solved any real issues.

                                    In any case, I think preventing identical-looking names is a non-issue. The correct solution to username impersonation is to pictogram or color coding of important usernames. The solution to attempts to trick admins into banning a wrong person is to allow admins to ban the post author rather than make them search by username.

                                    The solution to domain and package name squatting is… I’m not sure really, disallowing identical-looking names isn’t enough. Nothing is going to be foolproof, but one option is to search for names with very small edit distance and if there are any, show the user all names and their descriptions/signatures and make them explicitly choose from that list.

                                    $ packagemanager install letf-pad
                                    
                                    There are packages with similar names:
                                    
                                    left-pad | Library for padding text to the right.                        | jrandomhacker@example.net
                                    letf-pad | Totally not a typo-squatting attempt, I assure you. | honestjohn@example.com
                                    
                                    1. 1

                                      It’s also important to note that “characters” don’t exist, that code points are not something that should be expired to most programmers, and that UTF32 is a waste of space for little benefit as a result on any correct algorithm.

                                      1. 1

                                        It’s also important to note that “characters” don’t exist

                                        The Unicode Standard, which frequently uses the term, would be surprised to hear this.

                                        code points are not something that should be expired to most programmers

                                        The only – to my mind – sensible abstraction for exposing Unicode is as a sequence of some reasonably atomic unit of Unicode. Which leaves the choice of code points or graphemes. There are languages which have chosen code points, and languages which have chosen graphemes, and languages which pretend to have “Unicode” but really have byte arrays with few or no useful semantics.

                                        1. 1

                                          Graphemes are better for sure. Though lots of algorithms (like case conversion being the most famous) are better done on whole Text

                                    2. 4

                                      It’s interesting that the case-related examples use ‘artificial’ codepoints as opposed to ones that occur in natural language. For example, the German ß serves as an example both of how uppercasing can change a string length and how x.toUpper().toLower() == x is not always true. (And this isn’t even getting into how Turkish means that "i".toUpper() is locale-dependent…)

                                      Unicode is complicated because it has to model every human writing system, and human writing systems are complicated. There’s fundamental complexity here.

                                      1. 3

                                        Another interesting effect of Unicode is that concatenation may produce unexpected results if the strings had fragments of a grapheme cluster on their boundaries – whole new characters would appear where there weren’t any!

                                        For example: suppose we have two strings, the first containing U (U+0055, latin capital letter u) and the second containing ̈ (U+0308, combining diaeresis). Concatenating them does not give something like but rather Ü, a combined U and diaeresis:

                                        $ echo $first
                                        U
                                        $ echo $second
                                        
                                        $ chmap -c $first
                                        codepoint  glyph  encoded      case   description
                                           U+0055  U      55           upper  latin capital letter u
                                        $ chmap -c $second
                                        codepoint  glyph  encoded      case   description
                                           U+0308   ̈      CC 88        other  combining diaeresis
                                        $ echo ${first}${second}
                                        Ü
                                        $
                                        

                                        Two neat explanations on Unicode that I enjoyed:


                                        As an aside, why is this story tagged with historical? There’s no historical item/idea/software that’s the focus of this article.

                                        1. 3

                                          How embarrassing for the company hosting this blog.

                                          1. 2

                                            And, unfortunately, no, you cannot just ‘fix your strings’ at every use point. Some string operations are only safe to do once or you lose information or worse. You need to know and track the semantics of Strings to know what steps you need, and what steps you can’t take in the context you are working on.

                                            I see the blog has a number of Haskell posts. Perhaps Haskell has some feature that could help with this “tracking which operations are available on a value” problem?

                                            The post is useful in that it highlights wrong assumptions which programmers should avoid, but I disagree with its conclusion.