1. 3

    1. 1

      I know not what this “QString” thing is. Part of some nonstandard library. Odd thing to use in an example.

      You can write \N{NAME} in C++23 source and it compiles down to a single Unicode character, which becomes however many bytes it needs to be.

      I’m a bit confused by that — cppreference is vague about what encoding is used. It kind of implies that in a regular string literal it’s UTF-8, but it doesn’t say so, it just gives UTF-8 as an example:

      A universal character name in a narrow string literal or a 16-bit string literal may map to more than one code unit, e.g. \U0001f34c is 4 char code units in UTF-8 (\xF0\x9F\x8D\x8C) and 2 char16_t code units in UTF-16 (\xD83C\xDF4C).

      So either it’s always UTF-8, or it’s whatever encoding the source file uses.

      1. 2

        The painful thing is that C and C++ (which inherited this from C) do not mandate a character encoding for source. This was to allow things like ASCII or EBCDIC as inputs. I think C can work with 6-bit encodings. If you provide anything else, the compiler has to do some translation. This leads to fun if you write a character in UTF-8 but the person running the compiler has an 8-bit code page locale set. I ran into some very interesting bugs from non-ASCII characters in string literals when the code was built on a machine with a Thai or Japanese locale. GCC would parse the character in the user’s local and then emit it as the corresponding point in some other encoding (UTF-8?). Or, in the opposite direction, they’d type a character that their editor would save according to their locale’s charset and clang would pass it through. In both cases, the code expecting UTF-8 would then do the wrong thing (sometimes very excitingly if the byte sequence wasn’t even valid UTF-8).

        C++20 finally provided string formats that are guaranteed to output UTF-8 and, as long as your source is also UTF-8, everything is fine. I wish C++26 would just sat ‘source code must be UTF-8’. I doubt there’s much code that’s deliberately something else, and it’s a one-time mechanical translation to fix.

        1. 1

          I think C can work with 6-bit encodings.

          C’s basic character set has 74 characters and each basic character must be encoded as one byte. So it does not quite work for 6 bit character sets - I think historically 6-bit charsets were upper-case only which would make it tricky to write C.

          Which reminds me that Unix tty drivers had support for upper-case-only terminals - there were oddities like, if you gave login an upper-case-only username, it would switch the terminal into case smashing mode. I have no idea how well this worked in practice, or if it was cripplingly awkward.

          Back to C, I think the most restricted character sets it aimed to support were ISO-646, i.e. the international versions of ASCII that had things like nordic letters instead of []\

      2. 1

        QString is Qt’s string type and this guy is a Qt / KDE (desktop built on Qt) developer, so I guess thats some context that is missing. Like CString.