However, the most recent major update of text changed its internal string representation from UTF-16 to UTF-8.
One of the biggest mistakes that a language can make is to have a string representation. Objective-C / OpenStep managed to get this right and I’ve seen large-scale systems doubling their transaction throughput rate by having different string representations for different purposes.
This is particularly odd for a language such as Haskell, which excels at building abstract data types. This post is odd in that it demonstrates an example of the benefits of choosing a string representation for your workload (most of their data is ASCII, stored as UTF-8 to handle the cases where some bits aren’t), yet the entire post is about moving from one global representation to another.
For their use, if most of their data is ASCII, then they could likely get some big performance boots from having two string representations:
A unicode string stored as UTF-8, with a small (lazily-built - this is Haskell, after all) look-aside structure to identify code points that span multiple code units.
A unicode string stored as ASCII, where every code point is exactly one byte.
One of the biggest mistakes that a language can make is to have a string representation.
By this optic, we are in luck! Haskell has ~6 commonly used string types. String, Text, lazy Text, ByteString, lazy ByteString, ShortByteString and multiple commonly used string builders! /i
I am very happy with the text transition to UTF-8. Conversions from ByteString are now just a UTF-8 validity check and buffer copy and in the other direction a zero-copy wrapper change.
I think what David is saying is that ObjC has one string type (NSString/NSMutableString) with several underlying storage representations, including ones that pack short strings into pointers. That fact does not bubble up into several types at the surface layer.
Exactly as @idrougge says: a good string API decouples the abstract data type of a string (a sequence of unicode code points) from the representation of a string and allows you to write efficient code that operates over the abstraction.
NSString (OpenStep’s immutable string type) requires you to implement two methods:
length returns the number of UTF-16 code units in the string (this is a bit unfortunate, but OpenStep was standardised just before UCS-2 stopped being able to store all of unicode. This was originally the number of unicode characters.)
characterAtIndex: returns the UTF-16 code unit at a specific point index (again, designing this now, it would be the unicode character).
There is also an optional -copyCharacters:inRange:, which amortises Objective-C’s dynamic dispatch cost and bounds checking costs by performing a batched sequence of -characterAtIndex: calls. You don’t have to provide this, but things are a lot faster if you do (the default implementation calls -characterAtIndex: in a loop). You can also provide custom implementations of various other generic methods if you can do them more efficiently in your implementation (for example, searching may be more efficient if you convert the needle to your internal encoding and then search).
There are a couple of lessons that ICU learned from this when it introduced UText. The most important is that it’s often useful to be able to elide a copy. The ICU version (and, indeed, the Objective-C fast enumeration protocol, which sadly doesn’t work on strings) provides a buffer and allows you to either copy characters to this buffer, or provide an internal pointer, when asked for a particular range and allows you to return fewer characters than are asked for. If your internal representation is a linked list (or skip list, or tree, or whatever) of arrays of unicode characters then you can return each buffer in turn while iterating over the string.
The amount of performance that most languages leave on the floor from mandating that text is either stored in contiguous memory (or users must write their entire set of text-manipulation routines without being able to take advantage of any optimised algorithms in the standard library) is quite staggering.
a good string API decouples the abstract data type of a string (a sequence of unicode code points) from the representation of a string and allows you to write efficient code that operates over the abstraction.
How, when different abstractions have different tradeoffs? ASCII is single-byte, UTF-8 and UTF-16 are not, and so indexing into them at random character boundaries is O(1) vs. O(n). The only solution to that I know of is to… write all your code as if it were a variable-length string encoding, at which point your abstract data type can’t do as well as a specialized data type in certain cases.
Tangentially, you can find the start of the next (or previous) valid codepoint from a byte index into a UTF8 or UTF16 string with O(1) work. In UTF8, look for the next byte that doesn’t start with “0b10” in the upper two bits. I’m a known valid UTF-8 string it’ll be occur within at most 6 bytes. :)
(Indexing into a unicode string at random codepoint indices is not a great thing to do because it’s blind to grapheme cluster boundaries.)
Serious question, have you ever actually indexed randomly into ASCII strings as opposed to consuming them with a parser? I can’t personally think of any cases in my career where fixed-width ASCII formats have come up.
Serious question, have you ever actually indexed randomly into ASCII strings as opposed to consuming them with a parser? I can’t personally think of any cases in my career where fixed-width ASCII formats have come up.
I have, yes, but only once for arbitrary strings. I was writing a simple mostly-greedy line-breaking algorithm for fixed-width fonts, which started at character {line length} and then walked forwards and backwards to find word breaks and to find a hyphenation point. Doing this properly with the dynamic programming algorithm from TeX, in contrast, requires iterating over the string, finding potential hyphenation points, assigning a cost to each one, and finally walking the matrix to find the minimal cost for the entire paragraph.
I’ve also worked with serialised formats that used fixed-width text records. For these, you want to split each line on fixed character boundaries. These are far less common today, when using something like JSON adds a small amount of size (too much in the ’80s, negligible today) and adds a lot more flexibility.
For parallel searching, it’s quite useful to be able to jump to approximately half (/ quarter / eighth / …) of the way along a string, but that can be fuzzy: you don’t need to hit the exact middle, if you can ask for an iterator about half way along then the implementation can pick a point half way along and then scan forwards to find a character boundary.
More commonly, I’ve done ‘random access’ into a string because integers were the representation that the string exposed for iterators. It’s very common to iterate over a string, and then want to backtrack to some previous point. The TeX line breaking case is an example of this: For every possible hypenation point, you capture a location in the string when you do the forward scan. You then need to jump to those points later on. For printed output, you probably then do a linear scan to convert the code points to glyphs and display them, so you can just use an integer (and insert the hyphen / line break when you reach it), but if you’re displaying on the screen then you want to lay out the whole paragraph and then skip to the start of the first line that is partially visible.
ICU’s UText abstraction is probably the best abstract type that I’ve seen for abstracting over text storage representations. It even differentiates between ‘native’ offsets and code unit offsets, so that you can cache the right thing. The one thing I think NSString does better is to have a notion of the cheapest encoding to access. I’d drop support for anything except the unicode serialisations in this, but allow 7-bit ASCII (in 8-bit integers), UTF-8, UTF-16, UTF-32 (and, in a language that has native U24 support, raw unicode code points in 24-bit integers) so that it’s easy to specialise your algorithm for a small number of cases that should cover any vaguely modern data and just impose a conversion penalty on people bringing data in from legacy encodings. There are good reasons to prefer three of the encodings from that list:
ASCII covers most text from English-speaking countries and is fixed-width, so cheap to index.
UTF-8 is the densest encoding for any alphabetic language (important for cache usage).
UTF-16 is the densest encoding for CJK languages (important for cache usage).
UTF-32 and U24 unicode characters are both fixed-width encodings (where accessing a 32-bit integer may be very slightly cheaper than a 24-bit one on modern hardware), though it’s still something of an open question to me why you’d want to be able to jump to a specific unicode code point in a string, even though it might be in the middle of a grapheme cluster.
Apple’s NSString implementation has a 6-bit encoding for values stored in a single pointer, which is an index into a tiny table of the 64 most commonly used characters based on some large profiling thing that they’ve run. That gives you a dense fixed-width encoding for a large number of strings. When I added support for hiding small (7-bit ASCII) strings in pointers, I reduced the number of heap allocations in the desktop apps I profiled by over 10% (over 20% of string allocations), I imagine that Apple’s version does even better.
I’ve written code in Julia that uses the generic string functions and then have passed in an ASCIIStr instead of a normal (utf8) string and got speedups for free (i.e. without changing my original code).
Obviously if your algorithm’s performance critically depends on e.g. constant time random character access then you’re not going to be able to just ignore the string type, but lots of the time you can.
indexing into them at random character boundaries is O(1) vs. O(n).
Raku creates synthetic codepoints for any grapheme that’s represented by multiple codepoints, and so has O(1) indexing. So that’s another option/tradeoff.
Not grapheme clusters, not graphemes, not even code points, and not what a human would consider a character.
If you have the string "þú getur slegið inn leitarorð eða hakað við ákveðinn valmöguleika" and want to get the [42]nd letter, ð, indexing into bytes isn’t that helpful.
Oh, I see I misunderstood. So Raku is storing vectors of graphemes with multi-codepoint graphemes treated as a codepoint. Do you know how it does that? A vector of 32bit codepoints with the non-codepoint numbers given over to graphemes + maybe an atlas of synthetic codepoint to grapheme string?
How, when different abstractions have different tradeoffs? ASCII is single-byte, UTF-8 and UTF-16 are not, and so indexing into them at random character boundaries is O(1) vs. O(n).
Assuming that your data structure is an array, true. For non-trivial uses, that’s rarely the optimal storage format. If you are using an algorithm that wants to do random indexing (rather than small displacements from an iterator), you can build an indexing table. I’ve seen string representations that store a small skip list so that they can rapidly get within a cache line of the boundary and then can do a linear scan (bounded to 64 bytes, so O(1)) to find the indexing point.
If you want to be able to handle insertion into the string then a contiguous array is one of the worst data structures because inserting a single character is an O(n) operation in the length of the string. It’s usually better to provide a tree of bounded-length contiguous ranges and split them on insert. This also makes random indexing O(log(n)) because you’re walking down a tree, rather than doing a linear scan.
ByteString is the type you read UTF-8 encoded data into, then validate it is properly encoded before converting into a Text - it is widely used in places where people use “Strings” in other languages like IO because it is the intermediate representation of specific bytes. It fits very well in with the now common Haskell mantra of parse, don’t validate](https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-validate/) - we know we have some data, and we need a type to represent it; we parse it into a Text which we then know is definitely valid (which these days is just a zero copy validation from a UTF-8 encoded ByteString). It’s all semantics, but we’re quite happy talking about bytestrings as one of the string types, because it represents a point in the process of dealing with textual data. Not all ByteStrings are text, but all texts can be ByteStrings.
This comment reads very much like you’re quite ignorant of the actual state of strings in Haskell, particularly given how many people complain that we have too many representations.
Also, this article is specifically about code which relies on internal details of a type, so I’m not sure how your suggestions help at all - this algorithm would need to be written for the specific representations actually used to be efficient.
One thing I have wanted to do for a while is add succinct structures to UTF-8 strings which allow actual O(1) indexing into the data, but that’s something that can be built on top of both the Text and ByteString types.
This comment reads very much like you’re quite ignorant of the actual state of strings in Haskell, particularly given how many people complain that we have too many representations.
I don’t use Haskell but the complaints that I hear from folks that do are nothing to do with the number of representations, they are to do with the number of abstract data types that you have for strings and the fact that each one is tied to a specific representation.
Whether text is stored as a contiguous array of UTF-{8,16,32} or ASCII characters, as a tree of runs of characters in some encoding, embedded in an integer, or in some custom representation specifically tailored to a specific use should affect performance but not semantics of any of the algorithms that are built on top. You can then specialise some of the algorithms for a specific concrete representation if you determine that they are a performance bottleneck in your program.
One thing I have wanted to do for a while is add succinct structures to UTF-8 strings which allow actual O(1) indexing into the data, but that’s something that can be built on top of both the Text and ByteString types.
It’s something that can be built on top of any string abstract data type but cannot be easily retrofitted to a concrete type that exposes the implementation details without affecting the callers.
number of abstract data types that you have for strings and the fact that each one is tied to a specific representation
The types are the representations.
You can write algorithms that would work with any of String and Text and Lazy.Text in Haskell using the mono-traversable package.
However, that whole bunch of complexity is only justified if you’re writing a library of complex reusable text algorithms without any advanced perf optimizations. Otherwise in practice there just doesn’t seem to be that much demand for indirection over string representations. Usually a manual rewrite of an algorithm for another string type is faster than adding that whole package.
I always cringe a bit when I read things like:
One of the biggest mistakes that a language can make is to have a string representation. Objective-C / OpenStep managed to get this right and I’ve seen large-scale systems doubling their transaction throughput rate by having different string representations for different purposes.
This is particularly odd for a language such as Haskell, which excels at building abstract data types. This post is odd in that it demonstrates an example of the benefits of choosing a string representation for your workload (most of their data is ASCII, stored as UTF-8 to handle the cases where some bits aren’t), yet the entire post is about moving from one global representation to another.
For their use, if most of their data is ASCII, then they could likely get some big performance boots from having two string representations:
By this optic, we are in luck! Haskell has ~6 commonly used string types.
String
,Text
, lazyText
,ByteString
, lazyByteString
,ShortByteString
and multiple commonly used string builders! /iI am very happy with the
text
transition to UTF-8. Conversions fromByteString
are now just a UTF-8 validity check and buffer copy and in the other direction a zero-copy wrapper change.I think what David is saying is that ObjC has one string type (NSString/NSMutableString) with several underlying storage representations, including ones that pack short strings into pointers. That fact does not bubble up into several types at the surface layer.
Exactly as @idrougge says: a good string API decouples the abstract data type of a string (a sequence of unicode code points) from the representation of a string and allows you to write efficient code that operates over the abstraction.
NSString
(OpenStep’s immutable string type) requires you to implement two methods:length
returns the number of UTF-16 code units in the string (this is a bit unfortunate, but OpenStep was standardised just before UCS-2 stopped being able to store all of unicode. This was originally the number of unicode characters.)characterAtIndex:
returns the UTF-16 code unit at a specific point index (again, designing this now, it would be the unicode character).There is also an optional
-copyCharacters:inRange:
, which amortises Objective-C’s dynamic dispatch cost and bounds checking costs by performing a batched sequence of-characterAtIndex:
calls. You don’t have to provide this, but things are a lot faster if you do (the default implementation calls-characterAtIndex:
in a loop). You can also provide custom implementations of various other generic methods if you can do them more efficiently in your implementation (for example, searching may be more efficient if you convert the needle to your internal encoding and then search).There are a couple of lessons that ICU learned from this when it introduced
UText
. The most important is that it’s often useful to be able to elide a copy. The ICU version (and, indeed, the Objective-C fast enumeration protocol, which sadly doesn’t work on strings) provides a buffer and allows you to either copy characters to this buffer, or provide an internal pointer, when asked for a particular range and allows you to return fewer characters than are asked for. If your internal representation is a linked list (or skip list, or tree, or whatever) of arrays of unicode characters then you can return each buffer in turn while iterating over the string.The amount of performance that most languages leave on the floor from mandating that text is either stored in contiguous memory (or users must write their entire set of text-manipulation routines without being able to take advantage of any optimised algorithms in the standard library) is quite staggering.
How, when different abstractions have different tradeoffs? ASCII is single-byte, UTF-8 and UTF-16 are not, and so indexing into them at random character boundaries is O(1) vs. O(n). The only solution to that I know of is to… write all your code as if it were a variable-length string encoding, at which point your abstract data type can’t do as well as a specialized data type in certain cases.
Tangentially, you can find the start of the next (or previous) valid codepoint from a byte index into a UTF8 or UTF16 string with O(1) work. In UTF8, look for the next byte that doesn’t start with “0b10” in the upper two bits. I’m a known valid UTF-8 string it’ll be occur within at most 6 bytes. :)
(Indexing into a unicode string at random codepoint indices is not a great thing to do because it’s blind to grapheme cluster boundaries.)
Serious question, have you ever actually indexed randomly into ASCII strings as opposed to consuming them with a parser? I can’t personally think of any cases in my career where fixed-width ASCII formats have come up.
I have, yes, but only once for arbitrary strings. I was writing a simple mostly-greedy line-breaking algorithm for fixed-width fonts, which started at character {line length} and then walked forwards and backwards to find word breaks and to find a hyphenation point. Doing this properly with the dynamic programming algorithm from TeX, in contrast, requires iterating over the string, finding potential hyphenation points, assigning a cost to each one, and finally walking the matrix to find the minimal cost for the entire paragraph.
I’ve also worked with serialised formats that used fixed-width text records. For these, you want to split each line on fixed character boundaries. These are far less common today, when using something like JSON adds a small amount of size (too much in the ’80s, negligible today) and adds a lot more flexibility.
For parallel searching, it’s quite useful to be able to jump to approximately half (/ quarter / eighth / …) of the way along a string, but that can be fuzzy: you don’t need to hit the exact middle, if you can ask for an iterator about half way along then the implementation can pick a point half way along and then scan forwards to find a character boundary.
More commonly, I’ve done ‘random access’ into a string because integers were the representation that the string exposed for iterators. It’s very common to iterate over a string, and then want to backtrack to some previous point. The TeX line breaking case is an example of this: For every possible hypenation point, you capture a location in the string when you do the forward scan. You then need to jump to those points later on. For printed output, you probably then do a linear scan to convert the code points to glyphs and display them, so you can just use an integer (and insert the hyphen / line break when you reach it), but if you’re displaying on the screen then you want to lay out the whole paragraph and then skip to the start of the first line that is partially visible.
ICU’s UText abstraction is probably the best abstract type that I’ve seen for abstracting over text storage representations. It even differentiates between ‘native’ offsets and code unit offsets, so that you can cache the right thing. The one thing I think
NSString
does better is to have a notion of the cheapest encoding to access. I’d drop support for anything except the unicode serialisations in this, but allow 7-bit ASCII (in 8-bit integers), UTF-8, UTF-16, UTF-32 (and, in a language that has native U24 support, raw unicode code points in 24-bit integers) so that it’s easy to specialise your algorithm for a small number of cases that should cover any vaguely modern data and just impose a conversion penalty on people bringing data in from legacy encodings. There are good reasons to prefer three of the encodings from that list:UTF-32 and U24 unicode characters are both fixed-width encodings (where accessing a 32-bit integer may be very slightly cheaper than a 24-bit one on modern hardware), though it’s still something of an open question to me why you’d want to be able to jump to a specific unicode code point in a string, even though it might be in the middle of a grapheme cluster.
Apple’s
NSString
implementation has a 6-bit encoding for values stored in a single pointer, which is an index into a tiny table of the 64 most commonly used characters based on some large profiling thing that they’ve run. That gives you a dense fixed-width encoding for a large number of strings. When I added support for hiding small (7-bit ASCII) strings in pointers, I reduced the number of heap allocations in the desktop apps I profiled by over 10% (over 20% of string allocations), I imagine that Apple’s version does even better.I’ve written code in Julia that uses the generic string functions and then have passed in an ASCIIStr instead of a normal (utf8) string and got speedups for free (i.e. without changing my original code).
Obviously if your algorithm’s performance critically depends on e.g. constant time random character access then you’re not going to be able to just ignore the string type, but lots of the time you can.
Raku creates synthetic codepoints for any grapheme that’s represented by multiple codepoints, and so has O(1) indexing. So that’s another option/tradeoff.
Julia similarly allows O(1) indexing into its utf8 strings, but will throw an error if you give an index that is not the start of a codepoint.
But that’s just UTF-8 code units, i.e. bytes; you can do that with C “strings”. :)
Not grapheme clusters, not graphemes, not even code points, and not what a human would consider a character.
If you have the string
"þú getur slegið inn leitarorð eða hakað við ákveðinn valmöguleika"
and want to get the[42]
nd letter,ð
, indexing into bytes isn’t that helpful.Oh, I see I misunderstood. So Raku is storing vectors of graphemes with multi-codepoint graphemes treated as a codepoint. Do you know how it does that? A vector of 32bit codepoints with the non-codepoint numbers given over to graphemes + maybe an atlas of synthetic codepoint to grapheme string?
Assuming that your data structure is an array, true. For non-trivial uses, that’s rarely the optimal storage format. If you are using an algorithm that wants to do random indexing (rather than small displacements from an iterator), you can build an indexing table. I’ve seen string representations that store a small skip list so that they can rapidly get within a cache line of the boundary and then can do a linear scan (bounded to 64 bytes, so O(1)) to find the indexing point.
If you want to be able to handle insertion into the string then a contiguous array is one of the worst data structures because inserting a single character is an O(n) operation in the length of the string. It’s usually better to provide a tree of bounded-length contiguous ranges and split them on insert. This also makes random indexing O(log(n)) because you’re walking down a tree, rather than doing a linear scan.
I really miss working in the NS* world.
ByteString isn’t a string type though, it’s a binary sequence type. You should never use it for text.
ByteString is the type you read UTF-8 encoded data into, then validate it is properly encoded before converting into a Text - it is widely used in places where people use “Strings” in other languages like IO because it is the intermediate representation of specific bytes. It fits very well in with the now common Haskell mantra of parse, don’t validate](https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-validate/) - we know we have some data, and we need a type to represent it; we parse it into a Text which we then know is definitely valid (which these days is just a zero copy validation from a UTF-8 encoded ByteString). It’s all semantics, but we’re quite happy talking about bytestrings as one of the string types, because it represents a point in the process of dealing with textual data. Not all ByteStrings are text, but all texts can be ByteStrings.
This comment reads very much like you’re quite ignorant of the actual state of strings in Haskell, particularly given how many people complain that we have too many representations.
Also, this article is specifically about code which relies on internal details of a type, so I’m not sure how your suggestions help at all - this algorithm would need to be written for the specific representations actually used to be efficient.
One thing I have wanted to do for a while is add succinct structures to UTF-8 strings which allow actual O(1) indexing into the data, but that’s something that can be built on top of both the Text and ByteString types.
It sounds like you missed the
/i
in the parent post. I know, it’s subtle ;)That is not the parent post. Axman6 was replying to David. :)
argh, thread’s too too long :)
I don’t use Haskell but the complaints that I hear from folks that do are nothing to do with the number of representations, they are to do with the number of abstract data types that you have for strings and the fact that each one is tied to a specific representation.
Whether text is stored as a contiguous array of UTF-{8,16,32} or ASCII characters, as a tree of runs of characters in some encoding, embedded in an integer, or in some custom representation specifically tailored to a specific use should affect performance but not semantics of any of the algorithms that are built on top. You can then specialise some of the algorithms for a specific concrete representation if you determine that they are a performance bottleneck in your program.
It’s something that can be built on top of any string abstract data type but cannot be easily retrofitted to a concrete type that exposes the implementation details without affecting the callers.
The types are the representations.
You can write algorithms that would work with any of
String
andText
andLazy.Text
in Haskell using the mono-traversable package.However, that whole bunch of complexity is only justified if you’re writing a library of complex reusable text algorithms without any advanced perf optimizations. Otherwise in practice there just doesn’t seem to be that much demand for indirection over string representations. Usually a manual rewrite of an algorithm for another string type is faster than adding that whole package.