1. 46
  1. 21

    Old Soviet systems mostly used an encoding named KOI8-R. … That encoding is, to put it politely, mildly insane: it was designed so that stripping the 8th bit from it leaves you with a somewhat readable ASCII transliteration of the Russian alphabet, so Russian letters don’t come in their usual order.

    That is both horrifying, and fiendishly clever. 🤯

    1. 10

      The OG though was KOI-7, a 7-bit encoding which mapped onto ASCII in a similar fashion. Its huge advantage was legibility across mixed Latin/Cyrillic devices on a teletype network, which was a big deal for Soviet merchant fleet at the time.

      1. 1

        This sounds clever, but all I ever actually saw was an unreadable mess of characters misinterpreted as Latin-1 (CP-1252).

      2. 12

        I feel like the languages with completely different alphabets had it slightly better than languages with just a few additions. For the extra 8 letters, Polish systems had a significant number of custom encodings historically including apps doing their own thing: (from an old converter app http://www.gzegzolka.com/?m=info)

        Mazovia, FIDO Mazovia, CSK, Cyfromat, Dom Handlowy Nauki (DHN), IINTE-ISIS, TAG, Instytut Energii Atomowej w Świerku, Logic, Microvex, Corel Ventura, Elwro Junior CP/J, Amiga PL, TeX PL, Atari Calamus, CorelDRAW! 2.0, ATM, Amiga XPJ

        1. 11

          My favorite quirky encoding was AmigaOS’s “Priest Jan Pikula” character encoding. There was one guy, who happened to be a priest, who was very prolific in writing printer drivers. If you wanted to print something in Polish, you had to use his encoding.

          There was also a unique Polish encoding for TV Teletext.

          1. 3

            I’m Norwegian. We use the latin alphabet, but with three additions: æ/Æ, ø/Ø, and å/Å.

            My name contains an “ø”.

            The amount of times I’ve had that one “ø” turn into gibberish is actually staggering. Many other systems just don’t accept text which doesn’t match ^[a-zA-Z0-9 ]*$. Though thankfully, as Unicode has become more and more widespread over the decades, I have had a larger and larger probability of being able to enter my names into computer systems, even American-made computer systems.

          2. 8

            Same shit with Greek. Early DOS versions followed a local popular extension of the ASCII charset 437G/737 and not 869-ISO-8859-7. Even worse Microsoft made Windows-1253 and you could tell the wrong codepage by the rectangle you got when hyphenated capital A was used. Not to mention issues not only on filesystems and web pages but even DBMS. Thank God for Unicode.

            1. 1

              Even worse Microsoft made Windows-1253 and you could tell the wrong codepage by the rectangle you got when hyphenated capital A was used.

              Kind of like with short-lived SSL certificates, I think I prefer it to be immediately obvious that I’m using the wrong codepage rather than have it break in production with untested input. Yes, it makes it annoying to deal with when it’s not done right, but that’s not a bad thing.

            2. 2

              I still remember the iconv commands for transcoding Russian texts by heart. I used to be really good at guessing encoding from garbled output, by sight.

              There could be a whole nother blog post on Russian keyboard layouts,

              1. 2

                It gets even more complicated for languages like Ukrainian, where their alphabet is Cyrillic, but not exactly Russian (і, ї, ґ, є and no ы, ъ, э). CP1251 at least was designed from the start to accommodate for Ukrainian and Belarusian as well, whereas KOI8-R had to be modified into KOI8-U to support those letters.

                Thus, on top of the mess of Cyrillic encodings, there was also a mess of mostly working Cyrillic alphabet with slightly wrong letters (this can be seen even these days with fonts that only support Russian letters).

                1. 1

                  Mojibake is still with us. I just signed up for Mail Brew, the newsletter service. I follow some Japanese writers on Twitter, and when it tries to summarize the pages they link to, it gets tripped up because the pages are Shift JIS instead of UTF-8 and so it sends me gibberish. Fun times!

                  1. 1

                    Japanese has its own problems. If I understand correctly, there’s only one set of Han characters in Unicode, even though there are three different writing systems - the Chinese Hanzi, the Japanese Kanji, and the Korean Hanja - which all use different variations of Han characters. That means it’s up to the font to decide whether to draw Han characters as Chinese text, Japanese text, or Korean text. You may think you’re writing a Japanese e-mail, but the recipient might receive an e-mail with Chinese characters instead if their system happens to use a font with Chinese variations of Han characters. You also can’t mix Japanese and Chinese within one text.

                    Given the state of things, I’m not too surprised if Unicode hasn’t taken off as much in countries with Han-based writing systems.

                    1. 3

                      If I understand correctly, there’s only one set of Han characters in Unicode, even though there are three different writing systems

                      This is a very common “smart” take on Unicode and it’s absolutely wrong. We Roman alphabet users don’t have separate 0s for zero with the slash and zero without the slash. We don’t have different 7s for seven with the mid-line and seven without the mid-line. We don’t have different a’s for single storey a and double storey a. Same for g. Etc. Japanese specifically used more-Chinese like fonts before WWII. I’ve read prewar books. They look a hell of a lot like books from contemporary Taiwan except they’re in Japanese. Should we use one set of codepoints for prewar books and a separate set for postwar? It’s ridiculous.

                      Contemporary Japanese and contemporary (simplified Mandarin) Chinese use different fonts. Sometimes the ways you write a specific character are so different than an uneducated person might not realize they are the same. Nevertheless Unicode has ways to represent most of these variations, and for the other ones you just have to specify the proper font in your style sheet or whatever. Han unification is a good idea and only hyper-nationalists and people who don’t understanding encodings are upset by it.

                      You also can’t mix Japanese and Chinese within one text.

                      This is just wrong. Anyone who is able to actually read both Chinese and Japanese will be able to understand both font conventions. If you can’t understand the other font conventions, you’re not actually literate in the other language. There also isn’t one convention you would want to use everywhere. If a text is 95% one language, you probably want the little bits of the other language to be in the same font. OTOH, if you use 50-50 of the two languages, you probably want to specify different fonts. It’s a design decision that has nothing to do with encoding.

                      The actual problems with encodings in Asia are that there were a number of them in use pre-Unicode and moving to something new is difficult and takes time. Plus there is a lot of nationalist misinformation of the kind that you are repeating. There are also particular problems like that in Shift JIS, they replaced backslash with ¥, which in turn meant that Windows presents paths on Japanese systems as being C:¥Windows¥. To Japanese users, this is what a path “should” look like, so they don’t like that Unicode actually does the right thing and distinguishes between backslash and yen mark! Similarly, old Japanese encodings managed to squeeze a stripped down half-width katakana encoding into 8-bits, so some old systems expect to this day to receive half-width katakana, and switching it to take up a different number of bytes would require a rewrite. Etc.

                      1. 1

                        Hmm, do you have personal experience with the cultures you’re talking about? Or, on what authority do you speak about the subject? I don’t read or write any of these writing systems, but it’s clear that some of the characters are very visually distinct. How do you know that all the differences are just on the magnitude of the difference between a 7 with and without a mid-line, and that none of the differences are more substantial?

                        You give the example of a 0 with and without a cross, and yeah, that’s a small difference. But in my language, an O with and without a cross is a huge difference. So even though I can read a text with all the Øs replaced with Os and all the Ås replaced with As, it would be clear that the text isn’t really meant to be in my language. How do I know that Japanese people don’t feel the same way when they’re reading Chinese characters?

                        1. 1

                          Han unification is more akin to merging Ø with Ö. It would still look strange, but wouldn’t lose any information.

                          However, there are usually enough heuristics for your computer to tell what form (font) to use, or it will use your preferred form where they are interchangeable.

                          1. 2

                            If Han unification is akin to merging Ø with Ö, that’s absolutely terrible. They’re not the same character at all. I don’t see an Ö and read it as Ø; I see an Ö and read it as some Swedish character.

                            I need to be able to write texts where I can know that the recipient will see some kind of Ø. If I ever write texts with Swedish words mixed in with Norwegian, I would need to know that the recipient will see Swedish words with Ö. If merging Ø with Ö is remotely close to the severity of Han unification, then my stance on Han unification remains unchanged, despite /u/carlmjohnson’s various appeals to authority.

                            By the way, imagine us having this conversation using a version of Unicode with “Scandinavian unification” where Ø and Ö was in fact merged. We wouldn’t even be able to express these ideas without inventing our own names for the characters and referring to Ø as “O with slash” and Ö as “O with dots”. It would’ve be a mess - and it would’ve been clear to me that my language is a second-class citizen in the world of Unicode.

                          2. 1

                            Hmm, do you have personal experience with the cultures you’re talking about?

                            I have a PhD in Asian Philosophy and I lived in Japan for two years.

                            I don’t read or write any of these writing systems, but it’s clear that some of the characters are very visually distinct.

                            I don’t mean to sound snobby, but you’re illiterate, so what they look like to you isn’t really relevant. The point is that Unicode was made by professionals, so despite whatever rumors about Han Unification being bad float around the programming web, the fact is that it was done well, and that no character that anyone cares about was harmed in the process. There are some borderline cases, and there are tons of obscure Buddhist characters that aren’t in any encodings, Unicode or otherwise, but for the actual characters in Shift JIS or GB 2312 or ISO 2022, the process of unification worked well.

                            Diacritics are a big deal in European languages. If Unicode had merged o and ø, yeah, that would be a problem, even though heavy metal umlaut is just a joke in English. It would also be bad if Unicode had merged 学 and 學. But the Unicode Consortium are professionals and the East Asian countries have a lot of language prestige (BTW, the languages that are actually poorly represented in Unicode are the low status South Asian languages), so they didn’t mess it up.

                            1. 1

                              Maybe you know about these poorly represented cases, like Bengali, and what went down?

                              This link https://modelviewculture.com/pieces/i-can-text-you-a-pile-of-poo-but-i-cant-write-my-name popped up years ago, and I mentioned it to a linguist at a seminar. She seemed like a person who’d have some proper insight, and told me the contact people for Bengali weren’t really cooperative. She also opined it’s a bit unfair to attribute malice to white men like that. I never got proper deets, though.

                              Still that made me think there could actually be a story behind every seemingly 💩 decision.

                              1. 2

                                Found the blog post I read from a Microsoft engineer back in the day that goes into some of the problem with Indian encodings: http://archives.miloush.net/michkap/archive/2007/12/02/6639141.html

                                Another post with an interesting quote:

                                The strong feelings of native language speakers within India versus experts outside of it is an issues I have talked about a great deal in the past and plan to talk about more in the future, and it again underscores the same kind of issue. To which I’ll add that providing only the INSCRIPT keyboards and ISCII code pages as provided by the Government of India appears to show a certain specific degree of disinterest in needs of the many languages in country, if not outright disdain – like someone else’s agenda being pushed, however indirectly.


                                1. 1

                                  Really cool stuff! Definitely good to know if this ever comes up again :)

                                2. 2

                                  Another interesting link: https://en.wikipedia.org/wiki/Indian_Script_Code_for_Information_Interchange

                                  The Brahmi-derived writing systems have similar structure. So ISCII encodes letters with the same phonetic value at the same code point, overlaying the various scripts. For example, the ISCII codes 0xB3 0xDB represent [ki]. This will be rendered as കി in Malayalam, कि in Devanagari, as ਕਿ in Gurmukhi, and as கி in Tamil. The writing system can be selected in rich text by markup or in plain text by means of the ATR code described below.

                                  One motivation for the use of a single encoding is the idea that it will allow easy transliteration from one writing system to another. However, there are enough incompatibilities that this is not really a practical idea.

                                  Unicode tends to just piggyback off of existing encodings. So India made up an 8-bit straightjacket that privileged Hindi (Devanagari), and Unicode basically wasn’t in a position to fix that fundamentally bad starting point, so things have remained bad.

                                  1. 1

                                    I don’t know much about the specifics of what went wrong, but from the outside, programming-rumor-monger perspective, it seems like the core problem is that within India, Hindi and English are prestigious, and other languages are less prestigious, and then the Unicode consortium ended up doing a half assed job because they couldn’t form a proper committee of people who knew what they were talking about. There was a question of whether to represent the characters based on how they looked or sounded or compounded, and the thing they chose made sense made sense to programmers but not native speakers or IME users. I could be wrong about this though.

                        2. 1

                          Apple hardware was almost non-existent in Russia at the time because of its price

                          Apparently it wasn’t too expensive for designers. Many websites did have a mac encoding variant…