1. 23
  1.  

  2. 9

    many compilers make only a token effort

    I chuckled.

    1. 6

      In my opinion the root cause of this bug is that when a white anglophone developer sees a terminal they immediately interpret it as this idealized cartesian plane where they can lay out any writing they want in neatly-spaced characters that behave in “predictable” ways (in other words, behave exactly like English).

      I feel dirty even making this complaint, but does the word white need to be in there? I want to eliminate white centering and implicit bias as much as anyone, but I feel anglophone alone conveys all the necessary meaning without going down “a rabbit hole of edge cases and bugs”.

      1. 9

        I feel dirty even making this complaint, but does the word white need to be in there?

        Author here. This is a fair question.

        I added it specifically to highlight the fact that one of the fundamental characteristics of whiteness is to claim “default” status. Anything done by white people is considered “normal” in white-dominated societies and anything done by people of color is considered special or extra. People of color are more likely to be bilingual and recognize that perspectives other than the default exist.

        This is of course a much broader problem that affects a lot more than computing, but to assume that computing systems which were developed in white-dominant societies could somehow avoid these biases is wishful thinking. If you want to read more about this, I recommend the book White Fragility by Robin Diangelo; I found it to be very insightful.

        1. 6

          Not sure if I should be commenting, but as a native speaker of a relatively tiny language family, living in a white-dominated society of Finnish speakers, I couldn’t really see anyone considering the Finnish language as “default” or “normal” in technology. Rather, most of us recognize that English is the default and we have to adapt. What’s more, since this is a bilingual country, many user interfaces need to be available in three different languages (Finnish, Swedish and English) just to be usable for laypeople.

          I’m privileged in many ways but I and many like me are not pushing an English language hegemony for our own benefit. Rather, we’re working with and in English just to be able to communicate. Yes, we’ve adapted to the bias but it is not our bias.

          1. 4

            That’s true; this dynamic is mostly about how things are in the US, where the technologies in question were originally developed. Could have been more specific with that. edit: updated the post to reflect this.

            1. 4

              Another Finn here, that’s why usually when people talk about “whiteness” I just often realise it’s american thing, which then trickles down to everywhere..

            2. 6

              This is racist. I know what you’re pointing at (computing in general is anglocentric, which leads to a lot of confusion and misconceptions), but it has nothing to do with skin colour.

              Your default-argument could be easily turned around with regard to non-white societies assuming their defaults and looking critically at white people’s behaviour within them. I could easily give you numerous examples from Asia and Africa.

              1. 1

                And this is why I hesitated to even bring it up.

              2. 1

                I know exactly what white fragility is, and this was not that.

                1. 2

                  I didn’t say this was white fragility; I said the book titled “White Fragility” contained a good description of this phenomenon.

              3. 1

                Relatedly, I’d like to know if most programmers whose native language is remotely close to ASCII but explicititly not English… can be lumped together/simplified as being anglophone by probably learning programming languages with english keywords and/or reading up a ton of material in English. My main point being.. is it only English? Off the top of my head I can’t even think of any of these problems (RTL, or the abugida) occuring with the European languages I have at last a cursory knowledge of. German, Italian, French, Spanish… can the Dutch ij be a a single-width character or are ligatures a separate problem here?

                1. 3

                  Per Wikipedia, Dutch i j is generally written as two separate characters in Unicode and combining them is a ligature that the font may or may not choose to do. There’s also an ij ligature character but its use is deprecated. So you start getting to what the proper solution is, you have to ask the font what the character metrics are for a particular string, basically asking “how many cells do I have to advance to point to the in (let [无为] ...” instead of just being able to count bytes or characters, and properly handling Dutch i j would be done the same way. This is complicated (and kinda computationally expensive) and terminals have no real way of doing it. In general though, Latin, Cyrillic and Greek scripts are all pretty easy to shove into a grid. I suspect part of this is because 500 years of printing presses in Europe has slowly favored text styles that are easy to print.

                  However, apparently you can take fancier writing systems like Arabic and stick them into a grid if you want, and the result doesn’t even necessarily look terrible! You still have to do the whole “ask font for character metrics” thing though. For instance, the example text from that font includes طَّ which takes up one cell but is three Unicode code points: ط is the base letter and ّ and َ are also kinda-sorta-letters that attach to it, but it appears that there’s no explicit combining characters like you might get in French or Norwegian or whatever. How the letters are placed relative to each other is only encoded in how the font is rendered.

                  Maybe we need a font aware terminal protocol?

                  1. 1

                    Maybe we need a font aware terminal protocol?

                    While not entirely at the stage I want, this was largely a reason why the server end to arcan-shmif sends fonts (and desired “readable” pt size for proper dpi-aware rasterization) as part of the synchronous “pre-roll” stage when a client first makes a connection. The decoupling ‘terminal replacement’ API similarly provide oracle functions (“does my current font support this glyph?”) and the output packing format does RTL/LTR per line.

                    Do note that for GSUB it is not only the metrics for “advance” that are needed. The whole thing is actually a state machine that can consume arbitrary sequences to spit out font-local indices. This means that you can have a font where feeding “Windows” translates to indices that spell out “BSD”. How far does this rabbit hole go? well “my font is a game” is a thing .. https://www.coderelay.io/fontemon.html

              4. 4

                More insight on some themes in recent posts here: unicode encoding, graphene clusters, and rendering text in a monospace grid.

                1. 2

                  Seems like a pretty decent solution.

                  1. 1

                    To give a general overview, given this is my area of research, the Unicode consortium has pretty much given up in regard to the rendered text-width. The Unicode Standard Annex 11 (East Asian Width) explicitly states

                    The East_Asian_Width property is not intended for use by modern terminal emulators without appropriate tailoring on a case-by-case basis. Such terminal emulators need a way to resolve the halfwidth/fullwidth dichotomy that is necessary for such environments, but the East_Asian_Width property does not provide an off-the-shelf solution for all situations. The growing repertoire of the Unicode Standard has long exceeded the bounds of East Asian legacy character encodings, and terminal emulations often need to be customized to support edge cases and for changes in typographical behavior over time.

                    and thus pushes this topic into the domain of font-shaping and -rendering, which has become a monoculture and the Unicode consortium pretty much simply follows what Harfbuzz does and bakes it into the standard.

                    This problem is hard to solve and requires a deep interaction with the intricacies of font-shaping and -rendering. The problem is that you don’t have access to this information when you simply output to the terminal. An alternative would be to base the grid on a per-grapheme-cluster-basis and “distort” it accordingly. This would be a cool project, but definitely quite a challenge.