1. 24
    1. 8

      This is why in Roc we don’t have a “length” function for strings: if the strings are Unicode, it’s easy to have an inaccurate mental model of what a function named “length” is telling you about the string.

      Instead we have countGraphemes:

      https://www.roc-lang.org/builtins/Str#countGraphemes

      1. 5

        Have you/the Roc team done some research into whether counting graphemes is on average more useful than counting the clusters? I’m considering this question for Cara - so far I plan to make the String API work on clusters instead.

        Btw your documentation says “Counts the number of extended grapheme clusters in the string.”, which seems to be wrong to me if it gives 4 for “👩‍👩‍👦‍👦”

        (Swift counts clusters and gives 1 for “👩‍👩‍👦‍👦”)

    2. 2

      Hm I didn’t realize “Scalar” means “Code Point but not any of the surrogate values”

      Nice link to the Unicode glossary !

      I wonder what the cleanest/shortest JSON string (with surrogates) -> UTF-8 encoded text decoder looks like?

      I’m about to write one of those :)

      1. 2

        This linked article is also great:

        Summary:

        We’ve seen four different lengths so far:

        • 17 UTF-8 code units - Python 2, Rust, and Go
        • 7 UTF-16 code units - JavaScript
        • 5 UTF-32 code units / Unicode Scalars - Python 3, bash
        • 1 extended grapheme cluster, which doesn’t have a fixed definition – Swift

        So for https://www.oilshell.org, ${#s} evaluates to 5 following bash, and I agree that’s problematic / not useful.

        But len(s) is just 17, following Python 2, Rust, Go. And then you can have libraries to do the other calculations.

        1. 8

          I’ve had a draft post sitting in a folder for years that’s titled “yes it is wrong” and making the argument that for many people who just need to work with text, code units are the least useful possible abstraction to use.

          • They come with at least all the downsides of code points (you can accidentally cut in the middle of a grapheme cluster)
          • Plus the downside of being able to cut in the middle of a code point and produce something that’s no longer legal Unicode
          • Plus if it’s UTF-16 code units, as in quite a few languages, it’s coupled to the weird historical quirks of that encoding, including things like surrogate pairs

          And in general, I think high-level languages should not be exposing bytes or bytes-oriented abstractions for text to begin with. There are cases for doing it in low-level languages where converting to a more useful abstraction might have too much overhead for the desired performance characteristics, but for high-level languages it should be code points or graphemes, period.

          And don’t get me started on the people who think operations like length and iteration should just be outright forbidden. Those operations are too necessary and useful for huge numbers of real-world tasks (like, say, basically all input validation in every web app ever, since all data arriving to a web application is initially in string form).

          (and one final note – I think 5 is a perfectly cromulent length to return for that emoji, because it makes people learn how emoji work, just as a flag emoji should return a length > 1 to show people that it’s really composed from a country code, etc.)

          1. 4

            basically all input validation in every web app ever, since all data arriving to a web application is initially in string form

            Why does input validation care about grapheme length? It seems like the most irrelevant type of length.

            1. 6

              Graphemes are the closest thing to how humans (or at least the majority of humans who are not Unicode experts) normally/“naturally”/informally think about text. So in an ideal world, the most humane user interface would state “max length n” and mean n grapheme clusters.

              For some Western European scripts, code points also get kinda-sorta-close-ish to that, and are already the atomic units of Unicode, so also are an acceptable way to do things (and in fact some web standards are defined in terms of code point length and code point indexing, while others unfortunately have JavaScript baggage and are instead defined in terms of UTF-16 code units).

              1. 3

                But why do you have a max length validation that is close to anything people would count themselves at all? In my experience, there’s two kinds of max length requirements: 1) dictated by outside factors, where you do what they do regardless of what you think about it or 2) to prevent people from like uploading movies as their username, in which case you can set it to something unreasonably high for the context (e.g. 1 kilobyte for a name) and then the exact details aren’t super important anyway.

                1. 1

                  But why do you have a max length validation that is close to anything people would count themselves at all?

                  Ask Twitter, which in the old days had a surprisingly complex way to account for the meaning of “140 characters”.

                  But more seriously: there are tons of data types out there in the wild with validation rules that involve length, indexing, etc., and it is more humane – when possible – to perform that validation using the same way of thinking about text as the user.

                  If you can’t see why “enter up to 10 characters” and then erroring and saying “haha actually we meant this weird tech-geek thing you’ve never heard of” is a bad experience for the user, I just don’t know what to say. And doing byte-oriented or code-unit-oriented validation leads to that experience much more often.

                  The real go-to example is the old meme of “deleting” the family members one by one when backspacing on an emoji. That shows the disconnect and also shows really bd UX.

          2. 3

            I completely agree. To me, it’s about providing a string abstraction with a narrow waist. If you expose code units, then you are favouring a single encoding for no good reason. I want the string interface to make it easy for things above and below it.

            Implementations of the interface should be able to optimise for storage space, for seek speed, for insertion speed, or any other constraint (or combination of constraints) that applies. Nothing using the interface should see anything other than performance changes when you change this representation. Exposing Unicode code points is a good idea here because every encoding has a simple transformation to and from code points. If you expose something like UTF-16 code units then a UTF-8 encoding has to expand and then recompress, which is more work. Iteration interfaces should not assume contiguous storage (or even that consumers have raw access to the underlying storage), but they should permit fast paths where this is possible. ICU’s UText is a good example of doing this well (and would be between with a type system that allowed monomorphisation), it gives access to contiguous buffers that might be storage provided by the caller (and can be on stack) and callbacks to update the buffer. If your data is a single contiguous chunk, it’s very fast, if it isn’t then it gracefully decays.

            Similarly, it should be possible to build abstractions like grapheme clusters, Unicode word breaking, and so on, on top without these implementations knowing anything about the underlying storage. If you need them, they should be available even for your custom string type.

            1. 4

              UTF-8 is the only encoding that will give you a narrow waist :) That was one of the points of this blog post and hourglass diagrams:

              https://www.oilshell.org/blog/2022/02/diagrams.html#bytes-flat-files

              Just to clarify since I quoted that article, I would say that for a shell, it IS wrong that len(facepalm) is 7, but it’s ALSO wrong that len(facepalm) is 5.

              For a shell, it should be 17, and of course you can have other functions on top of UTF-8 to compute what you want. You can transform UTF-8 to an array of code points.


              One main reason is that the Unix file system and APIs traffic in bytes, and other encodings are not single encodings – they have a notion of byte order.

              There is no room for a BOM in most of the APIs.

              There is also no room for encoding errors. For example, in bash the string length ${#s} is a non-monotonic function of byte length because it gives nonsense values for invalid unicode. Basically you can have a string of 5 chars, and add 3 invalid bytes to it, and get 8. Then when you add the 4th valid byte, the length becomes 6.

              If your language is meant to deal with HTTP which has a notion of default encoding and encoding metadata, a different encoding may be better. But Unix simply lacks that metadata, and thus all implementations that attempt to be encoding-aware are full of bugs.


              The other huge reason is that UTF-8 is backward compatible with ASCII, and UTF-16, UTF-32 aren’t. It gracefully degrades.


              A perfect example of this is that a Japanese developer of Oil just hit a bug in Oil’s Python 2 scripts, with LANG=*utf8, because Python 2’s default encoding is ASCII. The underlying cause was GNU $(date) producing a unicode character in Japanese.

              Python 3 doesn’t fix these things – it took 6 years to choose a default file system encoding for Python 3 on all platforms, which is because it’s inherently an incoherent concept. You can have a single file system with 2 files with 2 different encodings, or 10 files with 10 different encodings.

              So the default behavior should be to pass through bytes, not guess an encoding and then encode/decode EVERY string. That’s also slow for many applications, like a build system.


              tl;dr A narrow waist provides interoperability, and UTF-8 is more interoperable than UTF-16 or UTF-32. Unix has no space for encoding metadata. Go-style Unicode works better for a shell than Python style (either Python 2 or 3).

              1. 3

                UTF-8 is the only encoding that will give you a narrow waist

                I think you’re missing my point. If the encoding is exposed, you’ve built a leaky abstraction. For a shell, you are constrained by whatever other programs produce (and that will typically be controlled by the current locale, though typically UTF-8 these days, at least for folks using alphabetic languages), but that’s a special case.

                1. 2

                  There is no abstraction, so it’s not a leaky one. Shell is not really about building “high” abstractions, but seamless composition and interoperability.

                  The leaky abstraction is the array of code points! The leaks occur because you have the problems of “where do I declare the encoding?” and “what do I do with decode errors?”, when 99% of shell programs simply pass through data like this:

                  ls *.py
                  

                  The shell reads directly from glob() and sends it to execve(). No notion of encoding is necessary. It’s also pointless to encode and decode there – it would simply introduce bugs where there were none.

                  Are you arguing for a Windows style or Python style where the file system returns unicode characters? I’d be very surprised by that, but I don’t have time to get into it here

                  Maybe in a blog post – the 2 above illuminated the issues very thoroughly

          3. 1

            The article explains why it’s not practical to expose something that’s not bytes, utf-16 code units, or utf-32 code units / scalars:

            Earlier, I said that the example used “Swift 4.2.3 on Ubuntu 18.04”. The “18.04” part is important! Swift.org ships binaries for Ubuntu 14.04, 16.04, and 18.04. Running the program

            So Swift 4.2.3 on Ubuntu 18.04 as well as the unic_segment 0.9.0 Rust crate counted one extended grapheme cluster, the unicode-segmentation 1.3.0 Rust crate counted two extended grapheme clusters, and the same version of Swift, 4.2.3, but on a different operating system version counted three extended grapheme clusters!

            Basically because it’s not really a property of the language, but of the operating system.

            So I’ll flip it around and say: instead of publishing why “it is wrong”, publish what’s right and we’ll critique it :-)

            1. 4

              I already said:

              for high-level languages it should be code points or graphemes, period.

              (and of the two I prefer code points as the base abstraction, but would not object to exposing a grapheme-oriented API on top of that)

              1. 1

                I strongly disagree, read this series of 7 blog posts by Victor Stinner to see what problems this choice caused Python:

                https://vstinner.github.io/painful-history-python-filesystem-encoding.html

                Between Python 3.0 released in 2008 and Python 3.4 released in 2014, the Python filesystem encoding changed multiple times. It took 6 years to choose the best Python filesystem encoding on each platform.

                (This is because the file system encoding is an incoherent concept, at least on Unix. libc LANG= LC_CTYPE= is also an incoherent design. )

                https://vstinner.github.io/python37-new-utf8-mode.html -

                his article tells the story of my PEP 540: Add a new UTF-8 Mode which adds an opt-in option to “use UTF-8” everywhere”. Moreover, the UTF-8 Mode is enabled by the POSIX locale: Python 3.7 now uses UTF-8 for the POSIX locale.

                Stinner is a hero for this …

                The only reasonable argument for Python’s design is “Windows”, and historically that made sense, but I think even Windows is moving to UTF-8.

                The number of bugs produced is simply staggering, not to mention the entire Python 2 to 3 transition.

                A Go-style UTF-8 design is not just more efficient, but has fewer bugs. Based on my experience, the kinds of bugs you listed are theoretical, while there are dozens of real bugs caused by the leaky abstraction of an array of code points, which real applications essentially never use. LIBRARIES like case folding use code points; applications generally don’t. Display libraries that talk to the OS use graphemes; applications generally don’t.

                1. 2

                  It is difficult to reconcile your praise for “Go’s UTF-8 design” with your leaning so heavily on the specific criticism of handling filesystem encoding.

                  Assuming that everything will be, or can be treated as, UTF-8 is incredibly dangerous and effectively invites huge numbers of bugs and broken behaviors. This is why Rust – which otherwise is a “just use UTF-8” language – has a completely separate type for representing system-native “strings” – like filesystem paths – which might well turn out to be undecodable junk.

                  Also, I know of you primarily as someone who works on a shell, and so to you Python 3 probably did feel like a big step backwards, since I’m sure Python 2 felt a lot easier to you.

                  But to me the old Python 2 way was the bad old broken Unix way. Which is to say, completely and utterly unsafe and unfit for any purpose other than maybe some types of small toy scripts that will only ever exist in the closed ecosystem of their developer’s machine(s).

                  Basically everybody in the world outside of the niche of Unix-y scripting, as the very first thing they would do on picking up Python 2, had to go laboriously re-invent proper text handling and encoding/decoding (the “Unicode sandwich” model, as we used to call it), and live in fear that they might have made a mistake somewhere and would have a pager go off at 2AM because a byte “string” had accidentally gotten through all the defenses.

                  This was, if my tone hasn’t made it abundantly clear already, not a pleasant way to have to do things, and Python 2 only got to be pleasant (or somewhat pleasant) for many users because smart and patient people did all the tedious work of dealing with its busted broken approach to strings.

                  Python 3 in comparison is a massive step forward and a huge breath of fresh air. Does it probably make life seem more complex for you, personally, and for other people like you? Sure, though I’d argue that it only ever felt simpler in the past because you were operating in a niche where the complexity didn’t rise up and bite you as often or as obviously as the way it did me and people working in other domains, in large part due to historical Unix traditions mostly avoiding doing things with text beyond ASCII or, in a few favored places, perhaps latin-1.

                  But even if it did just make your life outright more difficult, I think it would be a worthwhile tradeoff; you’re dealing with a domain that really is complex and difficult, and it should be on you to find solutions for that complexity and difficulty, not on everyone else to deal with a language and an approach to strings that makes our lives vastly more difficult in order to pretend to make yours a bit easier. The old way, at best, kinda sorta swept the complexity (for you, not for me) under the rug.

                  1. 2

                    No, the idea is not assuming everything is UTF-8. If you get an HTTP request that declares UTF-16, then you convert it to UTF-8, which is trivial with a library.

                    When you have a channel that doesn’t declare an encoding, you can treat it as bytes, and UTF-8 text will work transparently with operations on bytes. You can search for a / or a : within bytes or UTF-8; it doesn’t matter. You don’t need to know the encoding.

                    So the idea is that UTF-8 is the native representation, and you can either convert or not convert at the edges. The conversion logic belongs in the app / framework / libraries, not in the language and standard I/O itself.

                    That is how UTF-8 is designed; I don’t think you’re aware of that.

                    We’ll just have to leave this alone, but I don’t believe you’re speaking from experience. Everything you’ve brought up is theoretical (“this is how humans think, what humans want”), while I’ve brought up real bugs.

                    You’re also mistaking the user of the program with the programmer. UTF-8 is easy to understand for programmers, and easy to write libraries for – it’s not dangerous. The idea that UTF-8 is “incredibly dangerous” is ignorant, and you didn’t show any examples of “huge numbers of bugs and broken behaviors”.

                    The reality is the exact opposite – see Stinner’s series of 7 posts: https://vstinner.github.io/python37-new-utf8-mode.html

                    1. 2

                      When you have a channel that doesn’t declare an encoding, you can treat it as bytes, and UTF-8 text will work transparently with operations on bytes. You can search for a / or a : within bytes or UTF-8; it doesn’t matter. You don’t need to know the encoding.

                      Except UTF-8 does not “work transparently” like this, because at best there’s an extremely limited subset of operations you can do that won’t blow up. And that sort of sinks the entire idea that this thing is actually a “string” or “text” – if you’re going to call it that, people will expect to be able to do more operations than just “did this byte value occur somewhere in there”.

                      So the idea is that UTF-8 is the native representation, and you can either convert or not convert at the edges. The conversion logic belongs in the app / framework / libraries, not in the language and standard I/O itself.

                      Inside a program in a high-level language, the number of cases in which you should be working directly with a byte array or code-unit array and calling it “text” or “string” rounds to zero. At most, you could argue that Python should have gone the Rust route with a separate and explicitly not-text type for filesystem paths, But that still wouldn’t be an argument for having the string type be something that’s not a string (and byte arrays are not strings and code-unit arrays are not strings, no matter how much someone might want them to be).

                      That is how UTF-8 is designed; I don’t think you’re aware of that.

                      We’ll just have to leave this alone, but I don’t believe you’re speaking from experience. Everything you’ve brought up is theoretical (“this is how humans think, what humans want”), while I’ve brought up real bugs.

                      This is uncharitable to such an extreme degree that it is effectively just an insult.

                      But: Python 2’s “strings” were awful. They were unfit for purpose. They were a constant source of actual real genuine bugs. Claiming that they weren’t, or that Python 3 didn’t solve huge swathes of that all in one go by introducing a distinction between actual real strings and mere byte sequences, makes no sense to me because I and many others lived through that awfulness.

                      And recall that Python 3 – like Python 2 before it – initially did adopt code units as the atomic unit of its Unicode strings. And that still was a source of messiness and bugs, since it depended on compile-time flags to the interpreter and meant some code simply wasn’t portable between installations of Python, even of the exact same major.minor.patch version. It wasn’t until Python 3.3 that this changed and the Python string abstraction actually became a sequence of code points.

                      So if anyone needs to go re-read some Python history it’s you, since you seem to be thinking that something which wasn’t the case until Python 3.3 (string as sequence of code points) was responsible for trouble that dated back to prior versions (like I said, your argument really is not about the atomic unit of the string abstraction, it’s about whether filesystem paths ought to be handled as strings).

        2. 2

          5 (codepoints) is sensible in exactly one case: communicating offsets for string replacement/annotation/formatting in a language and unicode version and encoding agnostic way.

        3. 1

          Although if you convert “🤦🏼‍♂️” to []rune in Go, the length is 1 (which makes sense to me.)

          1. 2

            Did you try it? It should be 5 runes because a rune is a codepoint.

            1. 1

              I did try it but now I’ve tried it again, it’s 5 as you say. Which makes me wonder what I did wrong in the first attempt!

              1. 2

                Maybe you had a plain farmer and not black, male farmer.

        4. 1

          I’m away from a computer, but I wonder what’s Perl’s take on this? Perl always has great support for Unicode.

    3. 2

      Nobody calls them “scalars”, but codepoints!

      1. 1

        Scalar means code point that’s not in the surrogate range – i.e. it’s an actual valid character

        So the distinction is useful

        I learned that from the Unicode glossary link in this post

        1. 0

          Surrogates were a mistake and I won’t personally amend the already complicated Unicode terminology to suit this historical baggage.