As an Old Fart who’s been fascinated by type and typography since my teens, it’s been amazing to see the progress of computer text. My first computer didn’t even have lower case letters!
Coming from english, you might think ligatures are just fancy fluff. I mean, who really cares if “æ” is written as “ae”? Well, as it turns out, some languages are basically entirely ligatures. For instance “ड्ड بسم” has individual characters of “ड् ड ب س م”.
My favorite story about this (which I may have regaled you with before, being an Old Fart) is about a meeting between Apple and Sun about Java2D, circa 1998. The type expert from Apple is describing all the tables inside TrueType fonts for things like ligatures, and how supporting them is not optional. One of the Sun guys scoffs, “yeah, but how many people really care about fancy typography like ligatures?” The Apple person gives him a Look and says “about two billion people in India, the Middle East and Asia.” [Don’t take the number literally, it’s just my recollection.]
Every time I see beautifully rendered Arabic or Devanagari or Chinese or…. in my browser it still amazes me.
I’m too young to have actually experienced it first hand, but according to the Chinese textbooks and magazines I’ve read, in the 70s a lot of people genuinely thought the Han script wouldn’t survive because it was “incompatible” with computers.
To clarify, Chinese typography is relatively straightforward - Chinese characters are fixed width, and there is no ligature. I believe Japanese is in a similar position, hiragana and katakana are usually the same width as kanji. There are of course a lot of corner cases and advanced techniques, but putting characters on a grid probably already gets you 80% the way there.
The challenging part back was how to store the font, which consists of at least several thousand everyday characters, each of which needs to be at least 32px x 32px to be legible.
To be pedantic, Hangul is ligatures, but Korean has other writing systems too, for example Hanja, which are descended from Chinese characters (hence why the Han character space is referred to as CJK, for Chinese/Japanese/Korean), which actually gets into a fun gotcha about text rendering that this article didn’t cover: unified Han characters in the CJK space are rendered dependent on the font, which means that:
They will be rendered as their Simplified/Traditional Chinese/Hanja/Kanji representations depending on what font is used to render them, meaning external hinting needs to be used if these languages are mixed in one place
Hanja especially, but also Kanji, are used situationally and a large number are not commonly used, hence these characters may not be present in common Japanese/Korean fonts. In those cases, the cascade rules apply and a Chinese equivalent may be shown, potentially in an entirely different style.
They will be rendered as their Simplified/Traditional/Hanja/Kanji representations
Simplified and traditional characters are usually encoded separately because they are considered to be “different characters”. Only “the same character” with “regional stylistic differences” are unified. There is a lot of nuance, corner cases and mistakes.
Still inaccurate though, because “Chinese” is not one regional variant :)
Let’s say that the same character is shared by simplified Chinese, traditional Chinese, Korean Hanja and Japanese Kanji. There’s usually only one stylistic variant in simplified Chinese [1], but up to 3 different stylistic variants in traditional Chinese: mainland China [2], Hong Kong / Macau, and Taiwan.
[1] Because it’s standardized by mainland China’s government, and is followed by Singaporean Chinese and Malaysian Chinese
[2] Traditional Chinese is still used in mainland China, mostly for academic publications that need to cite a lot of old text
I don’t remember hearing about Chinese computers in the old days — China’s tech wasn’t as advanced and these wasn’t as much trade. Old Japanese computers used only Hiragana and/or Katakana, which were fortunately limited enough in number to work in 8-but character sets. In the mid 80s I used some HP workstations that had Katakana in the upper half of the encoding after ASCII; I remember those characters were useful in games to represent monsters or other squiggly things :)
Computers were pretty late to the party, the problems with Chinese text started with movable type. Until a few hundred years ago it was (apparently, from reading about the history of typography a long time ago) fairly common for neologisms in Chinese to introduce new ideographs. Often these would be tweaks of old ones (or compositions of parts of existing ones) to convey meaning. Movable type meant that you needed to get a new stamp made for every new character. Computers made this even worse because you needed to assign a new code point in your character set and get fonts to add glyphs. This combination effectively froze a previously dynamic character set and meant that now neologisms have to be made of multiple glyphs, allowing the script to combine the disadvantages of an ideographic writing system with the disadvantages of a phonographic one.
This combination effectively froze a previously dynamic character set and meant that now neologisms have to be made of multiple glyphs, allowing the script to combine the disadvantages of an ideographic writing system with the disadvantages of a phonographic one.
Well, this is a bit of an exaggeration.
Modern Chinese words are already predominately two or more characters, and neologisms are typically formed by combining existing characters rather than inventing new characters. Modern Chinese is full of homophones, and if you invent a new character today and say it out aloud it will almost certainly be mistaken for another existing character. (In case it isn’t clear, unlike Japanese, which has a kunyomi system, characters in Chinese have monosyllabic pronunciations as a rule; there are a few exceptions that are not very important.)
Most of the single-character neologisms are academic, where characters are written and read much more frequently than spoken and heard. All the chemical elements have single-character names, and a bunch of chemical and physical terms got their own new characters in the early 20th century. The former still happens today when new elements are named, but the latter process has already slowed down greatly before the advent of computers.
There are some non-academic, single-character neologisms that are actually spoken out. I don’t have time to fully delve into that here, but one interesting thing is that there are a lot of rarely used Chinese characters encoded in Unicode, and reusing them is a very common alternative to inventing brand new characters.
In the future people, will be amazed at how primitive our merely-32-bit Unicode was. By then every character will be a URI, allowing new glyphs and emoji to be created at whim. (This development will be closely followed by the first plain-text malware.)
So hard I decided that designing CPUs was easier than working on text layout.
Sometimes you can just ignore the SVG parts (I believe the Source Code Pro font technically contains some SVG glyphs, but in practice they aren’t actually used by websites), but in general you need to implement SVG support to draw All The Fonts.
This broke the Windows Terminal’s ability to render colour with Source Code Pro a few months back (apparently iTerm had the same bug earlier?). Older versions of the font included SVG colour tables, but they contained some nonsense. Newer versions have removed them.
There’s another approach for 6.2. You ship the fonts to the GPU as triangles approximating the beziers and use a pixel shader to transform them into antialiased images. You can do the clipping on the GPU so you end up only rasterising the subset of a glyph that fits in the box that you want to put it in. I am not sure if anyone actually does this in practice, but I really liked the paper that described it (from around 2005).
I recently came across some fun behaviour in PowerPoint with respect to 6.3. I am giving a talk at BlueHat IL and they sent me a template that was created in Hebrew (fun fact: PowerPoint hides the button that lets switch between RTL and LTR modes unless you have an RTL locale installed globally in Windows). The home and end keys did really fun things in text fields: home would move the visible cursor to the left side of a text box and the insertion point to the right side. I honestly have no idea how you can design a system where the former’s position isn’t determined by the latter, but I often underestimate the creativity of the Office team.
Disabling AA is increasingly common as screen resolutions increase. Particularly on mobile devices, pixels are so small that you can’t see them and so the benefit of AA relative to the cost is quite low.
There’s one other aspect that isn’t covered here, which should live in section 5. Antialising influences your placement decisions. This is why people moving from Windows to Mac think the text looks blurry, people moving in the opposite direction think the text has weird spaces. Apple’s text engine places the glyphs exactly where they should go in the coordinate system and then antialises them. Microsoft’s moves the glyphs a small amount (<1 pixel) in the coordinate system to try to align lines on pixel boundaries. This means that the glyphs on Windows need less antialiasing and look crisper, but if two adjacent characters have these nudges in opposite directions they can be one pixel closer together or further apart than they should be. That’s less of an issue on high DPI displays, but on older ones it was very visible.
As an Old Fart who’s been fascinated by type and typography since my teens, it’s been amazing to see the progress of computer text. My first computer didn’t even have lower case letters!
My favorite story about this (which I may have regaled you with before, being an Old Fart) is about a meeting between Apple and Sun about Java2D, circa 1998. The type expert from Apple is describing all the tables inside TrueType fonts for things like ligatures, and how supporting them is not optional. One of the Sun guys scoffs, “yeah, but how many people really care about fancy typography like ligatures?” The Apple person gives him a Look and says “about two billion people in India, the Middle East and Asia.” [Don’t take the number literally, it’s just my recollection.]
Every time I see beautifully rendered Arabic or Devanagari or Chinese or…. in my browser it still amazes me.
I’m too young to have actually experienced it first hand, but according to the Chinese textbooks and magazines I’ve read, in the 70s a lot of people genuinely thought the Han script wouldn’t survive because it was “incompatible” with computers.
To clarify, Chinese typography is relatively straightforward - Chinese characters are fixed width, and there is no ligature. I believe Japanese is in a similar position, hiragana and katakana are usually the same width as kanji. There are of course a lot of corner cases and advanced techniques, but putting characters on a grid probably already gets you 80% the way there.
The challenging part back was how to store the font, which consists of at least several thousand everyday characters, each of which needs to be at least 32px x 32px to be legible.
Korean is nothing but ligatures.
To be pedantic, Hangul is ligatures, but Korean has other writing systems too, for example Hanja, which are descended from Chinese characters (hence why the Han character space is referred to as CJK, for Chinese/Japanese/Korean), which actually gets into a fun gotcha about text rendering that this article didn’t cover: unified Han characters in the CJK space are rendered dependent on the font, which means that:
They will be rendered as their
Simplified/TraditionalChinese/Hanja/Kanji representations depending on what font is used to render them, meaning external hinting needs to be used if these languages are mixed in one placeHanja especially, but also Kanji, are used situationally and a large number are not commonly used, hence these characters may not be present in common Japanese/Korean fonts. In those cases, the cascade rules apply and a Chinese equivalent may be shown, potentially in an entirely different style.
Simplified and traditional characters are usually encoded separately because they are considered to be “different characters”. Only “the same character” with “regional stylistic differences” are unified. There is a lot of nuance, corner cases and mistakes.
There’s a lot of discussion in https://lobste.rs/s/krune2/font_regional_variants_are_hard
Updated, thanks for the correction.
Still inaccurate though, because “Chinese” is not one regional variant :)
Let’s say that the same character is shared by simplified Chinese, traditional Chinese, Korean Hanja and Japanese Kanji. There’s usually only one stylistic variant in simplified Chinese [1], but up to 3 different stylistic variants in traditional Chinese: mainland China [2], Hong Kong / Macau, and Taiwan.
[1] Because it’s standardized by mainland China’s government, and is followed by Singaporean Chinese and Malaysian Chinese
[2] Traditional Chinese is still used in mainland China, mostly for academic publications that need to cite a lot of old text
I don’t remember hearing about Chinese computers in the old days — China’s tech wasn’t as advanced and these wasn’t as much trade. Old Japanese computers used only Hiragana and/or Katakana, which were fortunately limited enough in number to work in 8-but character sets. In the mid 80s I used some HP workstations that had Katakana in the upper half of the encoding after ASCII; I remember those characters were useful in games to represent monsters or other squiggly things :)
Computers were pretty late to the party, the problems with Chinese text started with movable type. Until a few hundred years ago it was (apparently, from reading about the history of typography a long time ago) fairly common for neologisms in Chinese to introduce new ideographs. Often these would be tweaks of old ones (or compositions of parts of existing ones) to convey meaning. Movable type meant that you needed to get a new stamp made for every new character. Computers made this even worse because you needed to assign a new code point in your character set and get fonts to add glyphs. This combination effectively froze a previously dynamic character set and meant that now neologisms have to be made of multiple glyphs, allowing the script to combine the disadvantages of an ideographic writing system with the disadvantages of a phonographic one.
Well, this is a bit of an exaggeration.
Modern Chinese words are already predominately two or more characters, and neologisms are typically formed by combining existing characters rather than inventing new characters. Modern Chinese is full of homophones, and if you invent a new character today and say it out aloud it will almost certainly be mistaken for another existing character. (In case it isn’t clear, unlike Japanese, which has a kunyomi system, characters in Chinese have monosyllabic pronunciations as a rule; there are a few exceptions that are not very important.)
Most of the single-character neologisms are academic, where characters are written and read much more frequently than spoken and heard. All the chemical elements have single-character names, and a bunch of chemical and physical terms got their own new characters in the early 20th century. The former still happens today when new elements are named, but the latter process has already slowed down greatly before the advent of computers.
There are some non-academic, single-character neologisms that are actually spoken out. I don’t have time to fully delve into that here, but one interesting thing is that there are a lot of rarely used Chinese characters encoded in Unicode, and reusing them is a very common alternative to inventing brand new characters.
In the future people, will be amazed at how primitive our merely-32-bit Unicode was. By then every character will be a URI, allowing new glyphs and emoji to be created at whim. (This development will be closely followed by the first plain-text malware.)
Surely every new character will be a SHA512 of the result of adding the corresponding glyph data on a blockchain?
Previously (3yrs ago):
Text Rendering Hates You
ah my b, I was surprised to see it hadn’t been submitted before :)
That’s fine, it’s ok to re-submit after 6 months anyway, and the domain has changed in the meantime.
I just remembered the title and searched around :D
So hard I decided that designing CPUs was easier than working on text layout.
This broke the Windows Terminal’s ability to render colour with Source Code Pro a few months back (apparently iTerm had the same bug earlier?). Older versions of the font included SVG colour tables, but they contained some nonsense. Newer versions have removed them.
There’s another approach for 6.2. You ship the fonts to the GPU as triangles approximating the beziers and use a pixel shader to transform them into antialiased images. You can do the clipping on the GPU so you end up only rasterising the subset of a glyph that fits in the box that you want to put it in. I am not sure if anyone actually does this in practice, but I really liked the paper that described it (from around 2005).
I recently came across some fun behaviour in PowerPoint with respect to 6.3. I am giving a talk at BlueHat IL and they sent me a template that was created in Hebrew (fun fact: PowerPoint hides the button that lets switch between RTL and LTR modes unless you have an RTL locale installed globally in Windows). The home and end keys did really fun things in text fields: home would move the visible cursor to the left side of a text box and the insertion point to the right side. I honestly have no idea how you can design a system where the former’s position isn’t determined by the latter, but I often underestimate the creativity of the Office team.
Disabling AA is increasingly common as screen resolutions increase. Particularly on mobile devices, pixels are so small that you can’t see them and so the benefit of AA relative to the cost is quite low.
There’s one other aspect that isn’t covered here, which should live in section 5. Antialising influences your placement decisions. This is why people moving from Windows to Mac think the text looks blurry, people moving in the opposite direction think the text has weird spaces. Apple’s text engine places the glyphs exactly where they should go in the coordinate system and then antialises them. Microsoft’s moves the glyphs a small amount (<1 pixel) in the coordinate system to try to align lines on pixel boundaries. This means that the glyphs on Windows need less antialiasing and look crisper, but if two adjacent characters have these nudges in opposite directions they can be one pixel closer together or further apart than they should be. That’s less of an issue on high DPI displays, but on older ones it was very visible.
In fairness, you probably also frequently hate text rendering.
And the (also great) editing companion to this post: https://lord.io/text-editing-hates-you-too/
Why in the bloody hell is my suggest button not visible? Title should be amended with the year.