1. 33
  1. 9

    I would actually love to see C2x include a new stdio that defines a new type (glyph_t or rune_t, perhaps) which is a UTF32/UCS4 character/codepoint. char literals are actually already ints, and since they, at the moment, simply disallow multi-byte chars, changing the type of a char literal to this new type would not be a significant change (we would just keep the current semantics for assigning to a char now, but if assigning to a glyph_t, you get the whole multi-byte value). This should also coincide with the change of type for string literals to be const glyph_t [] (which, just like char could keep the current assignment semantics when being used to assign to a char */char [] but would have the UCS4 semantics for the new type).

    Then, this new stdio (stdunicode.h?) can define a few sensible implementations of the stdio functions which operate on glyph_t [] / glyph_t *.

    I know this is unlikely to happen since it would break a fair bit of current code, but it would dramatically improve the native support for unicode in C (which, at the moment, is abysmal).

    I know the article criticizes this, but being able to relatively-reliably iterate over a string can be really helpful, and I would personally love to have the option to bite (no pun intended) 4-bytes-per-character in return for that ability.

    1. 1

      […] being able to relatively-reliably iterate over a string can be really helpful, and I would personally love to have the option to bite (no pun intended) 4-bytes-per-character in return for that ability.

      When is this useful?

    2. 1

      Many of the stated arguments and excuses for using UTF-8 are good arguments for not using Unicode at all.

      1. 10

        What would you suggest as an alternative?

      2. [Comment removed by author]

        1. 13

          ASCII has something UTF-8 doesn’t.

          Such as? UTF-8 can be treated as a superset of ASCII.

          1. [Comment removed by author]

            1. 7

              In the places where you need one byte per codepoint, you’re probably in the ASCII character set, which again is inside UTF-8. It’s a strictly-better replacement.

              1. [Comment removed by author]

                1. 11

                  It’s slightly simpler to assume ascii, but it’s rarely true. If the strings you’re processing are coming from user input, and you want to process it without mangling the text, you need to bite the bullet and deal with some form of Unicode. (Either that, or deal with the mess that is local, language specific encodings like Shift-JIS. Thankfully, those are dying or dead.)

                  Once you accept that, it’s a small step to just make your APIs work with UTF-8 everywhere. It turns out that handling it correctly without mangling input is generally not that hard.

                  1. 8

                    I think what rain1 is saying is that there may be situations where you would prefer to garble user input than give up the one-byte-per-codepoint guarantee.

                    While he is technically correct that such a thing is possible, it’s difficult to imagine such a situation that isn’t horribly contrived. Perhaps in situations where the strings all come from internal hard-coded sources or when you never have to accept string input from the user?

              2. 8

                That sounds less like a text format and more like a binary stream. Not a lot of text can be safely encoded in ascii…

                1. 3

                  What does that buy you?

            2. 0

              The problem is when everyone starts using unicode when they don’t need to, because they are assuming it’s going to be supported. We already see this web pages using unicode characters and custom fonts for icons.

              This can also be compared to how sending malformed HTML became standard and accepted. Now everyone has to use gigantic xml/html parsing libraries like libxml2(try and guess how many lines of code and bugs are in that thing) to have any chance at writing web clients. It’s the same with unicode or any other similar standard.

              It’s not a good idea to just “use it everywhere”. This is the mentality that keeps pushing the ability control our own computing experience even further out of reach.

              1. 7

                net/html for go is a complete parser in 6700 lines (and that’s in go, a language that sacrifices expressiveness for readability).

                libxml2 is huge because of the batteries-included mindset behind it (it includes an ftp client - I kid you not - to parse XML)

                1. 10

                  … using unicode when they don’t need to …

                  The problem is that some people still think that ASCII is enough and nobody would ever use unicode in the XY context. Then, out of the blue, CKAN fails to decode some of our uploads and even fails to tell the user what is wrong with them. UTF-8 everywhere, please.

                  1. 3

                    Simply handling/not mangling unicode adds very little complexity. At its core, Unicode codepoints are the same as ASCII, just with more values. UTF8 is a simple packing scheme for them. Somehow segregating and differentiating between Unicode and non-Unicode text is far more complex than simply handling Unicode everywhere.

                    Doing unicode text rendering is difficult, but it’s no harder than supporting the myriad of locale specific encodings that existed before Unicode.

                  2. 1

                    Isn’t UCS-2 also advantageous because of uniform codepoint size? I mean, sometimes you need it… Right? Maybe not.

                    1. 3

                      String implementations based on UTF-16 or UCS-2 typically return lengths and allow indexing in terms of code units, not code points. It’s important to emphasise that neither code points nor code units correspond to anything an end user might recognise as a “character”; the things users identify as characters may in general consist of a base code point and a sequence of combining characters — Unicode refers to this as a grapheme cluster — and as such, applications dealing with Unicode strings, whatever the encoding, have to cope with the fact that they cannot arbitrarily split and combine strings.

                      via Wikipedia