1. 26
    1. 11

      Stringref is an extremely thoughtful proposal for strings in WebAssembly. It’s suprising, in a way, how thoughtful one need be about strings.

      Here is an aside, I promise it’ll be relevant. I once visited Gerry Sussman in his office, he was very busy preparing for a class and I was surprised to see that he was preparing his slides on oldschool overhead projector transparencies. “It’s because I hate computers” he said, and complained about how he could design a computer from top to bottom and all its operating system components but found any program that wasn’t emacs or a terminal frustrating and difficult and unintuitive to use (picking up and dropping his mouse to dramatic effect).

      And he said another thing, with a sigh, which has stuck with me: “Strings aren’t strings anymore.”

      If you lived through the Python 2 to Python 3 transition, and especially if you lived through the world of using Python 2 where most of the applications you worked with were (with an anglophone-centric bias) probably just using ascii to suddenly having unicode errors all the time as you built internationally-viable applications, you’ll also recognize the motivation to redesign strings as a very thoughtful and separate thing from “bytestrings”, as Python 3 did. Python 2 to Python 3 may have been a painful transition, but dealing with text in Python 3 is mountains better than beforehand.

      The WebAssembly world has not, as a whole, learned this lesson yet. This will probably start to change soon as more and more higher level languages start to enter the world thanks to WASM GC landing, but for right now the thinking about strings for most of the world is very C-brained, very Python 2. Stringref recognizes that if WASM is going to be the universal VM it hopes to be, strings are one of the things that need to be designed very thoughtfully, both for the future we want and for the present we have to live in (ugh, all that UTF-16 surrogate pair pain!). Perhaps it is too early or too beautiful for this world. I hope it gets a good chance.

      1. 4

        But python 3 also got strings wrong, and this article explicitly agrees!! Perhaps counterintuitively, a UTF-8 internal representation is more efficient and universal than the array of code points.

        Basically modern languages like Go and Rust got it right, Java and JS are still infected by windows, and Python 3 and Guile have a large amount of complexity they don’t need, which lowers performance. The article mentions moving Guile toward UTF-8.

        This is what I’ve been saying across these comment threads, and my own blog post, but there’s a surprising amount of pushback:

        https://lobste.rs/s/ql8goe/how_create_utf_16_surrogate_pair_by_hand

        https://lobste.rs/s/gqh9tt/why_does_farmer_emoji_have_length_7

        I didn’t see PyPy mentioned in this post - unlike cpython, it has a UTF-8 representation that implements the array of code points API in a compatible way.

        So I think the obvious implementation of stringref is to do that, but with 2 optional side tables - utf16 and utf32. I will write a bit more about this later!

        And btw I agree this is an excellent post. A bit dense but all the facts and conclusions are on point

        Also related to the recent discussion of why “universal VM” is a difficult/flawed concept - https://lobste.rs/s/v5ferx/scheme_browser_hoot_tale#c_n87bzw

        I do think some of universal string type is inevitable for performance, if WASM wants to support the common languages that people seem to want it to

        1. 5

          Yeah, I think if I were designing a language from scratch strings would be immutable byte sequences with methods that return different views of the byte slice as a sequence of bytes, codepoints, or grapheme clusters. I think Swift either does this or is close. 99% of the time, a view into bytes is all you need for parsing and whatnot, and you can just treat natural language text as opaque blobs.

          1. 2

            You’re right that Python 3 also got strings wrong. But it’s an experience people are very familiar with, and still carries the point of my post, even if it’s true that they didn’t get it “right” :)

          2. 2

            My biggest complaint about Unicode is that it doesn’t allow full round-trip conversion of binary data between the various transformation formats, and instead requires implementations to corrupt the data they are given with replacement characters, or to reject the data and refuse to operate. As a result, Python and Rust (and others I am less familiar with) have had to come up with various nonstandard contortions to cope with situations where it’s reasonable to expect UTF-8 but not guaranteed.

            In practise there is full round-trip conversion between WTF-16 (16 bit binary data) and WTF-8, but there is no 8-bit equivalent of bare surrogates that would allow a program to accept non-UTF-8 bytes and losslessly process them within the Unicode framework. If such a thing did exist then it would be easier to do the kind of shape-shifting string representation that Andy Wingo is talking about - the change of representation can happen quietly under the hood, with no need to report failure or risk of data loss. But note that Python’s surrogateescape is not the right way to do this, because it conflicts with WTF-8 and WTF-16; the right solution needs 128 newly allocated codepoints.

            1. 8

              Why do people keep insisting that you have to be able to pass arbitrary binary data through the channels explicitly designed for correctly-encoded text?

              1. 3

                Because they want to be able to treat data as text when it comes from channels that do not guarantee the encoding.

                Even in channels with explicit charset labels, mojibake exists. Even in systems that were supposed to be UTF-16, WTF exists. It’s wishful thinking to design systems that can’t cope with the way data is used in practice.

                1. 5

                  But the reality of systems is exactly why the separation exists! You want to encode facts like “this is a valid Unicode string” into the type system.

                  I maintain a library, camino which provides path types which have the type-level assertion that they’re valid Unicode (as most but not all paths in practice are). The Path/Utf8Path separation is very valuable when you’re thinking about which kinds of paths your program should support.

                  1. 5

                    Consider this shell program:

                    ls *.py
                    

                    The shell has to make these calls:

                    • glob('*.py')
                    • execve() on the results of glob().

                    Note that the program is correct, no matter what the encoding of the files is. Shell do not do any decoding or encoding of filenames – they are opaque bytes.

                    In other words, it doesn’t matter what the encoding is. And you can’t know – there’s literally no way to tell.

                    You have a filename encoded in UTF-8 and UTF-16 on the same file system, or in the same directory.

                    So if your programming language is forcing encoding/decoding at all boundaries, then it’s creating errors in programs where there were none.


                    It’s also bad for performance to do all the round-tripping. This isn’t theoretical – the JVM string support is a limiting factor in the performance of Bazel.

                    Also I think some of the Rust build systems (Buck2?) bypass Rust’s string type for this reason.

                    ripgrep apparently has its own string type which may or may not be valid UTF-8, also bypassing the default string types.


                    It’s definitely good for programming languages and APIs to help with data encoding problems. The problem is when they force you go through their APIs.

                    Then they’re taking responsibility for a problem that they cannot solve correctly. It’s a problem that depends on the program being written.

                    1. 6

                      But Path, the type representing a filepath in the std, is not utf8 in Rust. It is OS specific, e.g. WTF16 on Windows and “raw bytes” on Linux.

                      Applications that handle paths that may or may not be valid utf8 are expected to use the Path type so as to respect the native OS representation.

                      Camino is for paths that are controlled by the application, and so known to be utf8, and that want the added convenience of transparent conversion to native Rust strings.

                      1. 3

                        Sure. If it expected from the source that some data isn’t text, then don’t treat it as text. Bytes are bytes. Some bytes may look like text to a user, but that is the concern for a presentation layer.

                        1. 2

                          Rust does not force encoding and decoding at boundaries—if you stay in OsString land, you won’t go through any further validation.

                          You have to write non-UTF8-path supporting code carefully, but that’s always been the case (see Python 2 to 3 conversion). Rust just brings these considerations to the forefront.

                          Some applications need their own path and string types. That’s fine too.

                  2. 3

                    I don’t think I would call what Rust does contorting. The solution in Rust is to present the data as CString or OsString;

                    A CString is simply a Vec without NUL, and then with a NUL at the end, there is no guarantee about it being Unicode or UTF-8 or anything in specifiy. It COULD be a UTF8 string but there is no way to know.

                    The OsString covers more close to the complaint of the article too; it is a textual string in the native platform representation. If you were running on a JavaScript platform, this would be UTF-16 with Surrogates (WTF-16), on Windows it’s Unicode-16. But without first going through validating their unicodeness, Rust won’t let you play with them.

                    OsString being it’s own type means that if the data is just passing through, there is no need to check if it’s valid UTF-8. Ie, you asked for a directory listing on Windows and then open the file; there is no requirement to verify anything, Windows gave you a filename it considered valid, so you can open it.

                    You can even do some basic operations (joining OsStrings and indexing into them) which basically eliminates quite a few cases where in other languages you’d need to convert to the internal string format first.

                    This essentially solves the roundtrip problem.

                    1. 3

                      You missed out Path and friends.

                      The contortion is that Rust has five times the number of string types that it needs (or maybe three times), with fallible conversions that would be infallible if Unicode did not require lossage.

                      There are places where Rust fails to round-trip, most obviously if you need to report an error involving an OsString or Path.

                      So I think if Unicode were fixed, Rust could be much simpler and would have fewer nasty little corners where there are cracks in its string handling design that are “essentially” papered over.

                      1. 3

                        I think it is very valuable to have a type-level assertion that a string is valid Unicode (not arbitrary binary data, but actual real Unicode). I’m not sure how that can be done without a String/OsString/Vec like separation.

                        1. 1

                          The type-level assertion in Rust is about UTF-8 rather than Unicode as a whole. What problems would that type-level assertion help you with if the definition of UTF-8 were extended so there was no such thing as invalid UTF-8?

                          1. 1

                            Being able to distinguish invalid/corrupted data from valid data (even if it’s not something you can/care to interpret) is pretty useful. Lots of “can’t happen” scenarios in my life get turned up when something fails to parse, whether it’s a UTF-8 string or a network protobuf message or whatever. Otherwise you’d just get corrupted data and not be able to tell where in the chain it happens or what it might be intended to be.

                            UTF-8 is just a nice way to squeeze 21-bit numbers down into fewer bits when most of the numbers are small. If you have no such thing as invalid UTF-8, then you need some other nice compression scheme.

                            1. 1

                              I think a standard like UTF-8 should be designed to work well at the lowest layers where data transparency is often needed. Of course higher-level code will need to parse the data so it can do more specific unicode processing, but the cost shouldn’t be imposed where it isn’t needed.

                              What I want does not change the interpretation of anything that is currently valid UTF-8. It only changes the interpretation of invalid bytes, and 128 newly-special codepoints. https://dotat.at/@/2011-04-12-unicode-and-binary-data.html

                              1. 1

                                I think there’s space for something like that, but honestly that’s too much of a niche case and shouldn’t be the default encoding for strings.

                        2. 2

                          Sorry to interrupt the rant, but I’m trying to figure out precisely what the rant is about. Could someone explain using small words what it would mean to “fix” Unicode?

                          To keep things concrete: as things are an OsString, like a Windows filename, considers more byte sequences to be valid than UTF-8 does. For example:

                          bytes fe = "�" are invalid UTF-8
                          bytes with a null byte in the middle are invalid UTF-8 and invalid OsString
                          

                          (https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt)

                          The complaint seems to be that you can’t always convert an OsString into (say) UTF-8 and then back. Which makes sense to me, because in that case the OsString doesn’t have UTF-8 in it; in fact it likely doesn’t contain text at all. So some questions:

                          1. Most importantly, what would it mean to “fix” Unicode? Is the idea that UTF-8 should accept all byte sequences as valid? Or some new format? Are we supposed to tell the Unicode Consortium that not only do they have to create a format for all textual communication between humans in all languages no matter how stupid those languages’ writing rules are, but that it must also, to satisfy some programmers, accept all byte strings as valid?
                          2. What’s the use case for converting OsString -> UTF-8 -> OsString? Does this use case assume that the OsString contains text? If not, why is it being converted to a specific textual format in the middle, why not keep it as an OsString?
                          3. How about null bytes in the middle of a bye string? That’s invalid OsString as well as invalid UTF-8. Should that also be accepted, to allow even more byte strings to “round trip”?
                          1. 1

                            If a string isn’t UTF-8 it might just be really old.

                            Most importantly, what would it mean to “fix” Unicode? Is the idea that UTF-8 should accept all byte sequences as valid?

                            Right, like WTF-16 does for u16 sequences. WTF-16 is much more common in practice than UTF-16, because none of the systems that adopted UCS2 (Java, EcmaScript, Windows) enforced the requirement that surrogates must be paired when they moved to UTF-16. So if that’s OK, something similar for UTF-8 should be OK too. As I said, it only needs 128 codepoints and a tweak to UTF-8.

                            What’s the use case for converting OsString -> UTF-8 -> OsString?

                            The usual problem is something like printing an OsString as an error message. Full round-tripping between u8 bytes, u16 words, and unicode codepoints might not be something many programs do, but it is something that will happen in the ecosystem as a whole, and without it there will still be problems of loss of functionality and data.

                            How about null bytes in the middle of a bye string?

                            Unicode doesn’t require them to be corrupted or rejected, so they are outside the scope of my beef. I am happy if converting a String to an OsString is fallible, so long as the reverse is infallible.

                            1. 2

                              Or… we could try not putting non-textual garbage in our strings? Just for once? Please?

                              1. 3

                                I promise I’ll give up the idea of fixing UTF-8 when WTF-16 and WTF-8 have been eliminated.

                      2. 1

                        If I am not mistaken, Emacs solves the round-tripping problem by using 32 bit per character and stuffing invalid pints into the high bits somehow (see https://github.com/emacs-mirror/emacs/blob/master/src/character.h). I think that’s the right solution for that specific application since a text editor has to cope with arbitrary files, potentially containing text in multiple encodings at once, or even binary files.

                    2. 3

                      When I read things like this it can kind of get depressing because it feels like we are stuck with our mistakes forever. WASM was supposed to be a fresh start where we could finally get everything right this time, but it turns out that’s naive and unrealistic; the legacy of UTF-16 still haunts us.

                      One thing I don’t understand about this article is that it strongly implies that one of the things that makes it difficult to move off UTF-16 is that supposedly you can do O(1) indexing of strings. But my understanding is that’s only true of the older UCS-2 standard and not UTF-16, since nowadays not every character can fit into 2 bytes.

                      What’s going on here? Is the implication that you get O(1) random access for all strings except the ones that contain higher-plane characters? It seems the “guarantee” of fast random access isn’t actually a guarantee at all, so why is it so important to maintain?

                      1. 2

                        What it’s saying is that existing code treats strings as arrays of u16 code units (note: not 21 bit code points!) so indexing will give you bare surrogates instead of astral plane characters.

                        1. 2

                          Sure, I guess what I’m getting at is … why is this considered a useful property? What’s the point of random access by code point when code points don’t correspond to characters? People seem to be bending over backwards to support it and ruling out much simpler technically-superior alternatives because of this one factor which seems useless at first glance.

                          1. 4

                            Basically because that “wrong” API was entrenched in the most popular languages in the world, before we knew better:

                            • JavaScript, Java: random access by UTF-16 code unit
                            • Python: random access by code point
                            • C/C++ - Similar mess with wchar_t and more … and it even depends on mutable global variables ($LANG)

                            Yeah it sucks, but if you want WASM to run JS, Java, and Python, then it’s pretty much required. (At least if you want zero-copy inter-op – it’s arguable whether that’s required, but I can see why it’s a goal)

                            1. 3

                              I understand this is the current behavior of said languages.

                              But it sounds like there’s a bunch of people out there saying “oh, sure, you have a new VM for me to run my JS/Java/Python programs, but I’m not going to use it because when I make String Mistakes I need them to be made very fast”?

                              That’s … surprising to me. I’ve been using the JVM as my primary runtime since 2009 or so, and I can’t think of a single situation where I’ve ever wanted to look up the contents of a very large string other than by a regex.

                              Is this a language spec thing? Where you can’t claim to be “100% pure Java” unless you jump thru their bizarre hoops?

                              1. 2

                                But it sounds like there’s a bunch of people out there saying “oh, sure, you have a new VM for me to run my JS/Java/Python programs, but I’m not going to use it because when I make String Mistakes I need them to be made very fast”?

                                ??? Programmers have to use the API they’re given. There are correct programs that rely on random code point access. In fact probably it’s probably approximately ALL Python 3 programs, given how pervasive strings are.

                                Not sure what you’re suggesting … if Python is going to run on WASM, then the string semantics have to be exactly the same. Nobody is going to port the entire ecosystem over to a new string type – we already learned how that went with Python 2->3 :-)

                                It’s all economic – a big coordination problem, an example of path dependence, etc. It’s cheaper for the change to be in the platform VM than in every Python library and program in existence, etc.

                                1. 3

                                  Edit: I agree that you never almost need random code point access in the algorithmic sense.

                                  But “accidental” examples like this have been discussed, which means that most Python programs probably depend on it:

                                  i = s.find('/')  # linear search
                                  if i != -1:
                                    ch = s[i]  # random access
                                  

                                  That’s a relatively common idiom. Even though you don’t really need random access for the algorithm, the code needs it. So you would have to play some tricks like PyPy to make it fast.

                                  The code is correct though – it’s just using the suboptimal API that Python gives you.

                                  1. 1

                                    OK, I see; the example is helpful. I can see why you might be frustrated if you pulled in some 3rd-party code which did this, since even tho it’s not the ideal way to do it, there’s currently no downside to it in mainline Python.

                            2. 2

                              Note this random access is code units (u16 words) not code points (unicode characters). And the answer is, it wasn’t considered, it’s a historical hangover from when they thought unicode could fit in 16 bits.

                        2. 2

                          Mostly triggered by the sub-linked article about string lengths I’d argue that allowing, from an API design perspective, we shouldn’t offer any string length API at all. Whatever we provide is too easy inadvertently misuse.

                          We should provide:

                          • serialization to various encodings for storage and communication (UTF-8, possibly UTF-16, possibly others like Latin-9 with control over what to do with unrepresentable text)
                          • various collation functions defined by Unicode, usually implemented in ICU
                          • iteration of grapheme clusters
                          • string width

                          By “string width” I mean “how much space on the screen does it take to render this string”. This means that it needs to take some context because rendering into a terminal you care about columns and rendering in a GUI you care about pixels or subpixels, and you need to think about fonts. Even in terminals some grapheme clusters are rendered wider than others, so it’s kind of complicated, but you need to get it right.

                          Of course in practice there are millions of programs written in thousands of languages, most of which are either buggy or useless to most of the humanity that do rely on getting string lengths and indexing into strings to implement their bugs.

                          And if we were to build a popular system with my constraints all these damn programmers would end up just using vec<u8> to represent text because we’re all very fond of rewriting the same bugs again and again. I know I am.