Threads for semiotics

  1. 4

    I see that ASN.1 was mentioned, but what about the ASN.1 encoding control notation? That covers the “full control over encoding” gap. As far as I know, it’s expressive enough to handle most wire formats, targeted at serialization, and documentable.

    1. 5

      I really enjoy Mr. Beattie’s talks but I also consider them to be more performance art and edutainment than tech talk per-se.

      I think many of us can agree that “plaintext” is any body of unadorned unicode confomant characters with no other markup applied.

      1. 6

        There is a good point that “plain text” in this sense is fundamentally broken in an internationalized world, though. The talk doesn’t go into much detail, even though it’s important to the collation section—but fairly fundamentally, Unicode text needs to be paired with language/locale information to be usable—without something like that you can’t render (thanks to Han unification), give to a screenreader (both Han unification and many other shared scripts including Latin), collate accurately (as demonstrated in the talk, tailorings to UTS #10 provided by e.g. the CLDR are needed), etc.

      1. 41

        This is a nice description of the low level thing, but on the high level, I think there’s value in stopping of thinking of utf-8 as describing “characters”, but rather to think of it in terms of commands for some kind of printing virtual machine. This isn’t new to unicode: ascii also works this way.

        Consider the backspace “character”. This is a code point that is encoded by code units which are represented as octets. But what is it really? Well, I’d say it is a command to the printing machine to go back a space.

        When you get into the more complicated grapheme clusters, this mental model continues to work: what’s the meaning of something like the gender or skin tone markers in emojis? They’re just commands to the printer to change what it is doing. They’re the adjectives in the printer command sentence.

        Thinking of strings like this also helps to explain why you might not be able to easily jump around them or reverse them and so on - and again, it is nothing new to unicode!

        1. 17

          You may be predicting a future article ;) I want to continue to explore this space, especially how it interacts with other areas of stacks (IDNA, NFKC). Glad you enjoyed this first article!

          1. 2

            Excellent! I’ll look forward to reading that :)

          2. 10

            thinking of utf-8 as describing “characters”, but rather to think of it in terms of commands for some kind of printing virtual machine

            To be pedantic, it is unicode itself that you can think of in that way. UTF-8 is just commands to a tiny little decoding state machine that reconstructs those unicode commands (or a UnicodeDecodeError instead :D)

            1. 2

              yes, indeed. I’m actually usually less sloppy about these things! (I actually use a decent about of utf-16 too so it is often on my mind.)

            2. 7

              This isn’t new to unicode: ascii also works this way.

              I would argue that not thinking entirely in this way is what is new to unicode. Traditional character encodings arose out of the command sets used to control printers, which leads to precisely this sort of mindset, and all kinds of terrible hacks, such as encoding characters with double-acute accents as . The category Cc control codes continue to exist because Unicode wanted to be compatible with ASCII, Unicode would really rather have replaced them with a markup language. The Unicode standard doesn’t actually assign meaning to the control codes other than the record separators; they are provided for compatibility:

              The Unicode Standard provides for the intact interchange of these code points, neither adding to nor subtracting from their semantics. The semantics of the control codes are generally determined by the application with which they are used. […] In general, the use of control codes constitutes a higher-level protocol and is beyond the scope of the Unicode Standard.

              (Unicode 14.0, §23.1, p. 900)

              Things like unicode codepoints for arbitrary text superscripts and subscripts have been proposed, but haven’t been implemented since they’re viewed as really-the-purview-of-markup, not something that Unicode should carry itself.

              Thinking of strings like this also helps to explain why you might not be able to easily jump around them or reverse them and so on - and again, it is nothing new to unicode!

              The really special thing about unicode is that because it’s intended to encode the semantics of text more than it is to encode printer commands, you can do these things! You also need locale data, including the collation files that are the input to UAX#10, because code points are divided at pretty much the “one semantic grapheme that is shared across languages, but may behave differently in each” (see also: CJK Unified Ideographs…), but a code point plus a locale should give you some meaningful semantics that affect both shaping and collation. (Although there are, of course, exceptions: one particularly common example of insufficient semantics is the unfortunate diaeresis/umlaut situation: two diacritics with distinct meaning but which now have identical appearance & were consequently encoded as a single code point, which can cause certain issues when working with historical texts.)

              By far the majority of Unicode code points don’t really have a simple operational interpretation other than “act as if this thing were here”, which in both shaping and collation can have arbitrarily complex effect on the actual printed text, none of which is indicated or controlled at an operation level by the text in question. “Act as if this thing were here” is not anything like “typeset this character” (for shaping) in complex text layout languages! For example, in Devanagari, consider the tailored grapheme cluster क्षि, which is made up of four code points:

              • 0915 ( क ) DEVANAGARI LETTER KA
              • 094D ( ् ) DEVANAGARI SIGN VIRAMA
              • 0937 ( ष ) DEVANAGARI LETTER SSA
              • 093F ( ि ) DEVANAGARI VOWEL SIGN I

              Even considering only shaping, its shaped layout does not look anything like क्‌षि, which is what you get if you try to simply affix the vowel signs to the characters preceding them: rather, KA/VIRAMA/SSA ligate into a new character (क्ष), with this behaviour driven by quite a lot of smarts in the shaping algorithm, and indicated not at all by the text—which uses an identical sequence of letter and vowel sign characters to encode this as it does to encode क् षि (the latter containing a space between the KA/VIRAMA and SSA/I to make them parts of different words).

              The collation data is much closer to a set of instructions which drive an abstract machine (the UAX#10 collation algorithm), but the code points themselves do not encode this information (and can’t due to the aforementioned unification of grapheme elements across languages: for example, in Slovak, ch is considered a single character (and should sort and affect editing as such), but is encoded as two code points (U+0063 LATIN SMALL LETTER C followed by U+0068 LATIN SMALL LETTER H)).

              There are a few exceptions, which do have obvious operational interpretation, probably most famously the bidirectional text control characters. Again, the bidi control characters are often considered deprecated; the Unicode bidirectional algorithm standard (UAX#9) and the W3C both recommend that one use higher-level markup to control nesting/etc of bidirectional text elements if at all possible, rather than attempt to embed it in the text stream via the bidi control characters.

              The various joiner and non-joiner code points are also a bit like instructions to the shaping, collation, and normalizing algorithms about how to determine particular boundaries, but are also usually better viewed with semantic intent.

              1. 2

                You’re probably aware, but this was also codified in some international character set standards - you could use backspace as a combining character to create diacritics.

                e.g. you could use the sequence a<BS>” to mean ä

                I’m struggling to find the standard itself, but it’s referenced here:


                A character that, in some regions, could be combined with a previous character as a diacritic using the backspace character, which may affect glyph choice.

                1. 1

                  Yes, indeed. Makes perfect sense thinking of the old typewriter model! But makes questions like “what should string.length return?” pretty tricky.

                  1. 1

                    Indeed. Already have that problem in spades—do you count code units, code points, graphemes, ?

                    (For something like that, want ‘measure’, to also accounts for kerning&c)