In case anyone’s wondering why Rust has both ToString and seemingly redundant Display, instead of just .to_string() on Display itself: that’s because Display is been defined in core, which doesn’t use any memory allocator. String is heap-allocated, so it couldn’t have been used in Display‘s definition. Rust’s core/std split creates arbitrary fragmentation like that. OTOH io::Write mentions Vec, so the core can’t use it even for its non-allocating methods.
This isn’t really the essential part of your article, but seeing that string manipulation to get your asterisks halfway through the article made me wonder if we can do better
&self
.secret
.chars()
.enumerate()
.map(|(i, c)| if i < 4 { c } else { '*' })
.collect::<String>(),
In Python I would probably do
secret[:3] + '*' * len(secret[3:])
I wonder what the cleanest way to do this in Rust would look like…
Implicit in burntsushi’s response, but to make it clear: It’s done that way in Rust because Rust strings are UTF-8 encoded, so you can’t slice at arbitrary locations. There’s no way to know where the character boundaries are besides iterating over the string, unless you know your string only contains 1-byte characters. I would like to see the equivalent code in Go; I expect it looks like Rust.
Python unicode strings are all UTF-32, or whatever you call it when each character (codepoint? whatever) is 32 bits long, so you can index on word boundaries and always get something semi-sensible-ish out.
Python unicode strings are all UTF-32, or whatever you call it when each character (codepoint? whatever) is 32 bits long, so you can index on word boundaries and always get something semi-sensible-ish out.
This is incorrect.
Prior to Python 3.3, internal Unicode storage was a compile-time flag and set the storage for every Unicode object that interpreter would ever construct: “narrow” builds of Python used two-byte code units and surrogate pairs, while “wide” builds used four-byte code units. This had the disadvantage of wasting memory on strings containing only low code points, and the misfeature that on “narrow” builds surrogate pairs leaked to the programmer when certain operations – iteration, indexing, etc. – were performed.
Python 3.3 implemented PEP 393, and since then the internal storage of Unicode has been chosen on a per-string basis. When constructing a str object, Python determines the highest code point in the string, and then selects the narrowest encoding capable of handling that code point in a single code unit: latin-1, UCS-2, or UCS-4.
As a result, Python’s internal storage for any given string is fixed-width, which is nice for APIs, but without needing to waste tons of space to store, say, a string containing only ASCII code points into a multi-byte encoding. This also sometimes beats a pure UTF-8 approach, because UTF-8 would need two-byte code units in some strings that Python can do entirely with one-byte units.
Interesting, thanks. I believe the salient point here though is that Python string indexing/slicing is done, logically, by codepoint, regardless of the internal representation. While @icefox might have gotten the representation details wrong, do you agree with the central point?
I’m not super well versed in Unicode stuff, but my gut reading here is that Rust is effectively getting the same results as Python in practice here? Except Python decides to pay a bit of a representation cost for strings that require it by opting for UCS-4 when you have multi-byte stuff, in exchange for easier slicing semantics. In any case Python’s solution seems to be “just don’t use variable-byte sequences”. Both Python and Rust end up with some definition of character anyways.
I think that Rust’s values on this stuff are different than Python. I think Python tries to do “the right thing” for the most stuff possible, and is fine with inconsistent performance to get that working, where Rust will opt for APIs with consistent performance characteristics (see also how std::String doesn’t opt for small string optimization). Which is fine, of course.
This is all moot if Unicode Scalar Values in fact do a bunch of character collapsing and get you to a “real” definition of chars, but looking at the implementation etc, I don’t feel like Rust solves the “what is a character even?” question for Unicode that makes chars smarter than a Python character index operation. Really want a set of falsifiable operations for these discussions (like you have example calculations for floats/doubles to illustrate points better…)
The main difference, at a high level, between Rust and Python strings is that the representation of a Rust string is an API guarantee: it is UTF-8. Consequently, there are various APIs that make use of this guarantee. For example, in Rust, string[i] is not defined, but string[m..n] is, where m and n are byte offsets into the UTF-8 encoded bytes that make up string. If m or n fall on an invalid UTF-8 boundary or are otherwise out of bounds, then the operation panics. (In Rust-speak, that’s an “infallible” API, since the caller is responsible for ensuring the inputs are correct. But there are also less-ergonomic “fallible” APIs that don’t panic.)
In this particular case, you’re correct, I believe you wind up with the same thing. Because the OP isn’t slicing by byte offset, but rather, is specifically accumulating the codepoints from the string. Importantly, Rust doesn’t expose any APIs for indexing or slicing by codepoint. That is, Python’s indexing/slicing operation simply does not exist in Rust land. This is somewhat a consequence of fixing to a UTF-8 representation, but is also somewhat philosophical: a codepoint doesn’t have a ton of semantic value on its own. Codepoints are more like an implementation detail of Unicode specifications. It’s an atom on which Unicode is built. But codepoints are only incidentally “visual characters” or graphemes in some cases (admittedly, many). So if you write code that assumes “one visual character” is “one codepoint,” well, then, that code is going to wind up being incorrect. For example, in the OP’s code (and yours), you might split a grapheme in part and end up with text that is rendered in a garbled form.
What I’m circuitously getting at here is that Python makes the wrong thing easy, but also, to be fair, it might take you a while to notice that it’s the wrong thing. And it depends on what you’re doing. For example, if secrets are really always ASCII (even if never validated as such), then you’ll never see any ill effects: the OP’s and your code will always produce the desired results. It will even do so if the text contains non-ASCII text, but otherwise has no multi-codepoint graphemes (perhaps it’s in NFC form, although that just decreases the likelihood of observing multi-codepoint graphemes).
(Apologies if you know some of this stuff already. I got a little verbose, mostly because I have a lot of experience explaining this to people, and I get a lot of blank stares. It took a long time for this stuff to click for me as well, and largely that happened only after I started implementing some of the Unicode spec.)
What I’m circuitously getting at here is that Python makes the wrong thing easy, but also, to be fair, it might take you a while to notice that it’s the wrong thing.
Amusingly, I might say the same about Rust, and anything else which is fundamentally built around a “string” type that’s actually a sequence-of-encoded-bytes.
In Python you may slice in the middle of a grapheme. But you will never accidentally slice in the middle of a code point and end up with something that isn’t Unicode – the string APIs Python exposes simply don’t allow you to do that (the bytes type would, of course, let you do that, but the whole point is that bytes is not str).
The absolute ideal, of course, is a language where the string abstraction is a sequence of graphemes. Swift does that. But the second-best in my opinion is a language where the string abstraction is a sequence of code points. Far, far, far to the rear of that in last place is a language where the “string” isn’t a string at all but a sequence of bytes. It’s hard enough already to handle the complexity of Unicode; having to do it while also keeping track of the raw bytes and the way they map to higher-level Unicode abstractions is just too much to pile on.
I know Rust doesn’t take this view (and probably 95% of my complaints would go away if Rust just called it UTF8Bytes instead of using string-y names), but the more I see of how people mess up just Unicode itself, the less I want to give them yet other layers to mess up while they’re at it.
But you will never accidentally slice in the middle of a code point and end up with something that isn’t Unicode
You don’t in Rust either. As I said, if your slice offsets split a codepoint, then the slicing operation panics. (And there are fallible APIs for when you don’t want a panic.)
Far, far, far to the rear of that in last place is a language where the “string” isn’t a string at all but a sequence of bytes. It’s hard enough already to handle the complexity of Unicode; having to do it while also keeping track of the raw bytes and the way they map to higher-level Unicode abstractions is just too much to pile on.
To be honest, I don’t know what you’re talking about. You don’t have to keep track of anything. I’ve implemented many of Unicode’s algorithms in Rust, which in turn are used to provide nice higher level APIs.
It is plausible that a sequence of graphemes is the nicest representation in theory, but one of the nice properties about using UTF-8 directly is that there are very little costs associated with it. And at least with respect to the codepoint representation, you can write code that deals with exactly that encoding and only that encoding. (If Python strings use one of three possible representations and you need to implement some lower level primitive on it, then it seems to me like you either need to make it work over all of its possible representations or you need to pay to force the string into a particular representation.)
I know Rust doesn’t take this view (and probably 95% of my complaints would go away if Rust just called it UTF8Bytes instead of using string-y names), but the more I see of how people mess up just Unicode itself, the less I want to give them yet other layers to mess up while they’re at it.
I cannot really fathom how most of your complaints would go away by just changing the name. A string by any other name is, still, just a string.
My view comes down to the simple observation that bytes are not strings. Forcing the “string” type to actually be a byte sequence comes with a whole pile of issues. For example, you mention that:
in Rust, string[i] is not defined, but string[m..n] is, where m and n are byte offsets into the UTF-8 encoded bytes that make up string. If m or n fall on an invalid UTF-8 boundary or are otherwise out of bounds, then the operation panics.
This only happens when the “string” type is actually a sequence of things that aren’t some atomic unit of Unicode, such as when the “string” is really a sequence of encoded bytes.
Python used to do this – the old Python 2 str type was actually a byte sequence, and even up to 3.3 it was more often a sequence of UTF-16 code units, and that was a miserable approach for writing many types of applications, because all that does is add yet another layer of complexity on top of what you already need to know just to successfully work with Unicode. For example:
I’ve implemented many of Unicode’s algorithms in Rust, which in turn are used to provide nice higher level APIs.
In my ideal string implementation, the language or at the very least its standard library would provide sufficient implementation of Unicode, its abstractions, and its algorithms that implementing yourself outside of that core support wouldn’t be an idea that would occur to anyone, let alone something you actually might have to do.
And again it comes back to the fact that a byte sequence is not a string. A byte sequence doesn’t know the semantics of a string. And byte sequence is very difficult to bolt the semantics onto, and even more difficult to bolt ergonomic, natural-feeling string APIs onto.
I know Rust has decided to go with byte sequences anyway. I know it’s done for reasons that, to the folks behind Rust, seem perfectly reasonable and sync up with their goals for the language. But in the world where I work (primarily web apps, a world Rust seemingly would like to expand into at some point), we spent most of this century so far demonstrating at length why byte sequences are not strings and are a poor substitute for them.
I cannot really fathom how most of your complaints would go away by just changing the name. A string by any other name is, still, just a string.
Many of my complaints would go away in the same sense that many of my complaints about Python’s byte-sequence type went away when it stopped being called str and started being called bytes, and stopped carrying expectations that it could take the place of a real string type.
If Rust just said “we have UTF8Bytes, and implementing string operations and APIs is an exercise left to the reader, or to the first person to do a good crate for it”, I’d be less down on it. But Rust insists on calling it a string type when it very plainly isn’t.
My view comes down to the simple observation that bytes are not strings.
Very strongly disagree. You can be prescriptive about this, but it’s descriptively not true. Many programming languages define “string” as some variant of a bundle of bytes. Sometimes it’s “arbitrary non-NUL bytes.” Sometimes it’s, “arbitrary bytes, but conventionally UTF-8.” Sometimes it’s, “UTF-8 validity required.” But the point is that this representation is exposed, and these things are called strings. You saying they aren’t just makes thing confusing.
You could make an argument that they shouldn’t be strings, and I would say that I have no interest in that argument because it’s just chasing tails around the maypole at this point. But to say that any of the above things factually are not strings is just in contradiction with common vernacular.
This only happens when the “string” type is actually a sequence of things that aren’t some atomic unit of Unicode, such as when the “string” is really a sequence of encoded bytes.
So?
Python used to do this – the old Python 2 str type was actually a byte sequence, and even up to 3.3 it was more often a sequence of UTF-16 code units, and that was a miserable approach for writing many types of applications, because all that does is add yet another layer of complexity on top of what you already need to know just to successfully work with Unicode.
There are many things that contributed to Python 2 strings being miserable. Python 3 strings have been pretty miserable for me too. But if I’m being honest, that just might be because Python is unityped. I don’t know. And I haven’t used Python in anger since it started getting type hints. I don’t know how effective they are. I probably won’t ever know.
In my ideal string implementation, the language or at the very least its standard library would provide sufficient implementation of Unicode, its abstractions, and its algorithms that implementing yourself outside of that core support wouldn’t be an idea that would occur to anyone, let alone something you actually might have to do.
I am one of the people doing that work. I maintain Rust’s regex library. I maintain crates that define string types. I maintain the csv, aho-corasick, ucd-generate and more.
And this is shifting the goalposts anyway. You can’t be all things to all people. Even if you provide standard abstractions to various Unicode-y things, lower level libraries are still likely to want to build their own to optimize certain access patterns.
And again it comes back to the fact that a byte sequence is not a string. A byte sequence doesn’t know the semantics of a string. And byte sequence is very difficult to bolt the semantics onto, and even more difficult to bolt ergonomic, natural-feeling string APIs onto.
Go does it quite successfully. As does my bstr crate.
That doesn’t match my experience. When I ship a program that reads files on millions of developer systems, my program doesn’t know the encoding of that file. Discovering the encoding is a non-option because of performance. So what do I do? I treat the file as a single contiguous string of bytes. Doing anything else is a non-starter.
Despite this, my tooling (ripgrep), has very good Unicode support. Certainly better than GNU grep and most other grep tools out there.
we spent most of this century so far demonstrating at length why byte sequences are not strings and are a poor substitute for them.
I don’t think you have. Actually, I haven’t seen one good reason from you yet why they are bad.
Many of my complaints would go away in the same sense that many of my complaints about Python’s byte-sequence type went away when it stopped being called str and started being called bytes, and stopped carrying expectations that it could take the place of a real string type.
But Python 3 did a lot more than just rename some types. And the naming wasn’t really a matter of “incorrectly” calling byte sequences “strings,” but rather, having both a str and a unicode type. Now that’s weird. So fixing the names wasn’t just about moving off of byte sequences, but just making them holistically sensible. Moreover, Python 3 has made a lot of other changes with respect to string types in order to make it work. In this conversation, you’re acting as if all the problems boiled down to calling byte sequences strings, and once that was fixed, all was good. But that’s a pretty narrow and incomplete view of what happened.
If Rust just said “we have UTF8Bytes, and implementing string operations and APIs is an exercise left to the reader, or to the first person to do a good crate for it”, I’d be less down on it. But Rust insists on calling it a string type when it very plainly isn’t.
Again, you’re conflating is and ought. You’re more than welcome to advocate for (many) programming languages to stop using the word “string” unless it meets your very specific representational requirements. But I don’t think you’ll get very far.
My experience is that the representation doesn’t make a whole lot of difference to whether something can be effectively used as a textual type or not. You’ve given literally zero examples of something useful that a Python string can do that a Rust string cannot. The only possible answer to that is that a Python string has constant time index/slicing by codepoint, but as I’ve covered already, this is almost never what you want and makes it easy to split grapheme clusters.
Thanks for the detailed reply. For context on my end, I work on a multi-lingual Python app, and Py2 -> Py3 really helped to cement my belief that Python is “right” on this front. I recently also found an issue in an ocaml application that was basically linked to its (my belief, faulty) treatment of text as “a string of bytes”, so I do believe pretty strongly about a structure having some sort of guarantee about textual validity.
Python you have str (that I end up calling text in my head) and bytes. Your text is “textual input”, for the common understanding of that. It abstracts away the encoding details etc (a common source of bugs), and yeah, you’re not working with bytes! You want bytes? You gotta explicitly encode it. Thanks to that, any “text”-y operations tend to require you to deal with encodings seriously.
I feel like Rust tries this somewhat with String‘s “this is always valid utf-8” stuff. After all, if String were just a bag of bytes, then you wouldn’t say something like this. But it ends up exposing byte representations through slicing anyways! And yeah, you get falliable APIs, but Python isn’t being inconsistent. It has slicing semantics for text that give text. Rust doesn’t have slicing semantics for text, it has byte slicing.
In my ideal world (and , y’know, just me, with my specific wants and things I do/don’t care about) , we would have ByteSequence and Text, so you could get String-like APIs (I get that String is a Vec wrapper, but that’s also the problem IMO), but you could differentiate between “I actually don’t care what’s in this data/I want the raw bytes” vs “This is meant to be consumed as text” (for various definitions). That way you would be forced to establish where text-ness is relevant.
I think that Rust’s current str/String semantics try too hard to support emojis out of the box (glib for “nice multilingual text support” :) ), while also trying to be a mostly-drop-in replacement for char*. If we were splitting things up to separate types (With the right impl coverage to “cover bases” as needed) I think there would be a lot less confusion, without having to have the philosophical argument about string slicing over and over and over again. Like if you want to work with bytes, work with bytes! (The one dilemna would be that Text would either have to do something like Python’s fixed-width normalization, or it would have O(n) slicing…. but at least when I gotta break out decode/encode in Python, it’s explicit and I don’t have much trouble answering what I need)
EDIT: just another on the “wrong thing” of splitting graphemes… I think you’re not wrong, and I’ve generally not had issues (despite dealing with text that does have potential grapheme splitting issues), and I think a big part of it is very good text APIs for splitting etc, so I’m not doing index math in the first place. Still would want a non-String alternative but it’s a bit hard for me to say that split graphemes are “right”.
I will say that I feel like this conversation has gotten twisted into something I did not intend. I guess I should have left out the bit about “Python making the wrong thing easy.” I guess to me, it seems like a pretty obvious fact: it exposes natural codepoint indexing and slicing APIs, and when you use those without awareness of graphemes, those APIs will let you split a grapheme and wind up with garbled text.
I guess to me, saying that felt like a fact. But based on the replies I got from you and @ubernostrum, I think it came off as a more general claim about which string is “better” and thus a debate ensues.
My previous comment was meant more for educational means, to fill in gaps I saw. Not to spawn off a “my language’s string type is better!” debate.
You also have to realize that we have some really intense biases at play here, even aside from language preference. For example, I work on low level text primitives, and have even built a string library that specifically works on raw bytes, without the UTF-8 invariant. I also work on CLI tools that use these primitives and their primary goal is to be fast. In these contexts, I love that Rust’s string type is just a bag of bytes, because it means there’s no extra steps or indirection required for the implementation of the lower level primitives. In other words, the way Rust’s string is defined makes it easy to not only write fast code, but correct code.
But that’s only one angle at which to look at string APIs. There are other angles and use cases and domains.
(The one dilemna would be that Text would either have to do something like Python’s fixed-width normalization, or it would have O(n) slicing…. but at least when I gotta break out decode/encode in Python, it’s explicit and I don’t have much trouble answering what I need)
So maybe we just have a more fundamental disagreement. It sounds to me like you specifically want “index/slice by codepoint” APIs, where as I think we shouldn’t.
But yes, exactly, you’re getting at the heart of the matter with the performance implications here.
Sorry, don’t mean for this to be a flame war (talking about this helps me think about it too), I think like @ubernostrum if we had multiple string types I’d have less gripes.
But ultimately it’s not the worst thing (at least it’s not bag of ASCII right?) and I like the language! And… yeah, I think “fast and correct” is definitely a rust philosophy. And hey, I can use both languages for various things!
let input = "this_is_my_input";
let mut output = String::new();
output.push_str(&input[..4]);
output.push_str(&"*".repeat(input.len()-4));
assert!(output == "this************");
But this is prone to error. What if the input only has three characters? What if there’s a multi-byte unicode character spanning the 4th and 5th bytes? We’ll get a runtime panic. Since the slice is over the bytes, do we want multi-byte characters to become one * or multiple *?
I much prefer the original formulation in so far as it concisely expresses the intent of the transformation as a map on the characters in the string. Every operation feels essential and cannot fail.
I mean none of the operations you listed can fail either, when you have a slicing semantic that accept “running off the edge”. Just a different way of looking at stuff.
I guess I’m not a fan of “treat strings like a bunch of characters to walk over”. in this specific case. It feels too much like busywork of trying to solve for the most efficient implementation of something (that potentially doesn’t matter in the first place). Just my opinion tho
I know it’s not 100% on topic, but it’s still a great conversation point. I am not an expert Rustacean (more of a wannabe at this stage), so I also would love to know if there is any terser way to do that same string manipulation…
If you know your secret is ASCII—and you might actually in this case—then you can validate it as such and then make ASCII encoding assumptions later in your code. An example of an ASCII assumption would be that every character is exactly one byte, and thus, you could use &self.secret[..3] to get the first three characters.
If you cannot make an ASCII assumption and secret really can be any arbitrary UTF-8 encoded text, then chars() is pretty good, but it would probably be even better to use graphemes.
The other interesting bit here is that @rtpg’s solution will throw an exception if len(secret) <= 2, while your code will not, but will instead display the entirety of the secret. My guess is that len(secret) <= 2 can’t actually happen.
My suspicion is that in practice, there is a particular format that you might expect secret to be. (Perhaps a certain length, a minimum length, all hex digits, etc etc.) If that’s true, then I would make construction of Credentials fallible, and return an error when secret (and perhaps also api_key) doesn’t satisfy your assumptions. Later on, you can then use those assumptions in various ways to make code simpler or faster (not that the latter is needed here).
With all that said, I think the way you wrote it is just fine. Secrets are unlikely to be short, and secrets are unlikely to contain multi-codepoint graphemes. And if they do, the worst that will happen is that your obfuscated output will become slightly more obfuscated.
In Python this doesn’t throw an exception, because “string * negative number” gives the empty string (and slicing allows for out of bounds stuff). It’s the semantics that “just work” for common string munging use cases
I agree with your general sentiment about a defined credential shape. Not to mention that in practice you probably wouldn’t even need to have more than the first three characters (does the number of stars really need to align with the underlying secret?) This is definitely a problem where you would find something decent if you had more context.
Oh interesting. I’ve been writing Python for over a decade and I just assumed that providing an invalid end bound to a slice would fail. e.g., I assumed that 'ab'[:3] would throw an exception. TIL.
(Although, I have not written much Python in the fast few years, so perhaps I knew this at one point and my knowledge has atrophied.)
Hey, thanks to both for the great suggestions (I didn’t know about graphemes and it’s great learning!).
For the sake of the article, I just wanted to provide a simple example, so I didn’t spend too much time thinking about the shape of the data and the validation constraints. I think I’ll keep the example as it is, but nonetheless, I loved seeing how you reasoned about the data to determine constraints and how those would affect the code.
In case anyone’s wondering why Rust has both
ToString
and seemingly redundantDisplay
, instead of just.to_string()
onDisplay
itself: that’s becauseDisplay
is been defined incore
, which doesn’t use any memory allocator.String
is heap-allocated, so it couldn’t have been used inDisplay
‘s definition. Rust’score
/std
split creates arbitrary fragmentation like that. OTOHio::Write
mentionsVec
, so thecore
can’t use it even for its non-allocating methods.Great tip! I updated the article and mentioned this!
This isn’t really the essential part of your article, but seeing that string manipulation to get your asterisks halfway through the article made me wonder if we can do better
In Python I would probably do
I wonder what the cleanest way to do this in Rust would look like…
Implicit in burntsushi’s response, but to make it clear: It’s done that way in Rust because Rust strings are UTF-8 encoded, so you can’t slice at arbitrary locations. There’s no way to know where the character boundaries are besides iterating over the string, unless you know your string only contains 1-byte characters. I would like to see the equivalent code in Go; I expect it looks like Rust.
Python unicode strings are all UTF-32, or whatever you call it when each character (codepoint? whatever) is 32 bits long, so you can index on word boundaries and always get something semi-sensible-ish out.
This is incorrect.
Prior to Python 3.3, internal Unicode storage was a compile-time flag and set the storage for every Unicode object that interpreter would ever construct: “narrow” builds of Python used two-byte code units and surrogate pairs, while “wide” builds used four-byte code units. This had the disadvantage of wasting memory on strings containing only low code points, and the misfeature that on “narrow” builds surrogate pairs leaked to the programmer when certain operations – iteration, indexing, etc. – were performed.
Python 3.3 implemented PEP 393, and since then the internal storage of Unicode has been chosen on a per-string basis. When constructing a
str
object, Python determines the highest code point in the string, and then selects the narrowest encoding capable of handling that code point in a single code unit: latin-1, UCS-2, or UCS-4.As a result, Python’s internal storage for any given string is fixed-width, which is nice for APIs, but without needing to waste tons of space to store, say, a string containing only ASCII code points into a multi-byte encoding. This also sometimes beats a pure UTF-8 approach, because UTF-8 would need two-byte code units in some strings that Python can do entirely with one-byte units.
Interesting, thanks. I believe the salient point here though is that Python string indexing/slicing is done, logically, by codepoint, regardless of the internal representation. While @icefox might have gotten the representation details wrong, do you agree with the central point?
Python splits by code point, in
Objects/unicodeobject.c:PyUnicode_Substring
, the relevant bit isSo yeah, you end up splitting on codepoints.
Rust
chars
splits on Unicode Scalar Values. According to this Unicode glossary Unicode Scalar Values are code points except for surrogates.I’m not super well versed in Unicode stuff, but my gut reading here is that Rust is effectively getting the same results as Python in practice here? Except Python decides to pay a bit of a representation cost for strings that require it by opting for UCS-4 when you have multi-byte stuff, in exchange for easier slicing semantics. In any case Python’s solution seems to be “just don’t use variable-byte sequences”. Both Python and Rust end up with some definition of character anyways.
I think that Rust’s values on this stuff are different than Python. I think Python tries to do “the right thing” for the most stuff possible, and is fine with inconsistent performance to get that working, where Rust will opt for APIs with consistent performance characteristics (see also how std::String doesn’t opt for small string optimization). Which is fine, of course.
This is all moot if Unicode Scalar Values in fact do a bunch of character collapsing and get you to a “real” definition of
chars
, but looking at the implementation etc, I don’t feel like Rust solves the “what is a character even?” question for Unicode that makeschars
smarter than a Python character index operation. Really want a set of falsifiable operations for these discussions (like you have example calculations for floats/doubles to illustrate points better…)The main difference, at a high level, between Rust and Python strings is that the representation of a Rust string is an API guarantee: it is UTF-8. Consequently, there are various APIs that make use of this guarantee. For example, in Rust,
string[i]
is not defined, butstring[m..n]
is, wherem
andn
are byte offsets into the UTF-8 encoded bytes that make upstring
. Ifm
orn
fall on an invalid UTF-8 boundary or are otherwise out of bounds, then the operation panics. (In Rust-speak, that’s an “infallible” API, since the caller is responsible for ensuring the inputs are correct. But there are also less-ergonomic “fallible” APIs that don’t panic.)In this particular case, you’re correct, I believe you wind up with the same thing. Because the OP isn’t slicing by byte offset, but rather, is specifically accumulating the codepoints from the string. Importantly, Rust doesn’t expose any APIs for indexing or slicing by codepoint. That is, Python’s indexing/slicing operation simply does not exist in Rust land. This is somewhat a consequence of fixing to a UTF-8 representation, but is also somewhat philosophical: a codepoint doesn’t have a ton of semantic value on its own. Codepoints are more like an implementation detail of Unicode specifications. It’s an atom on which Unicode is built. But codepoints are only incidentally “visual characters” or graphemes in some cases (admittedly, many). So if you write code that assumes “one visual character” is “one codepoint,” well, then, that code is going to wind up being incorrect. For example, in the OP’s code (and yours), you might split a grapheme in part and end up with text that is rendered in a garbled form.
What I’m circuitously getting at here is that Python makes the wrong thing easy, but also, to be fair, it might take you a while to notice that it’s the wrong thing. And it depends on what you’re doing. For example, if secrets are really always ASCII (even if never validated as such), then you’ll never see any ill effects: the OP’s and your code will always produce the desired results. It will even do so if the text contains non-ASCII text, but otherwise has no multi-codepoint graphemes (perhaps it’s in NFC form, although that just decreases the likelihood of observing multi-codepoint graphemes).
This issue is related: https://github.com/BurntSushi/aho-corasick/issues/72
(Apologies if you know some of this stuff already. I got a little verbose, mostly because I have a lot of experience explaining this to people, and I get a lot of blank stares. It took a long time for this stuff to click for me as well, and largely that happened only after I started implementing some of the Unicode spec.)
Amusingly, I might say the same about Rust, and anything else which is fundamentally built around a “string” type that’s actually a sequence-of-encoded-bytes.
In Python you may slice in the middle of a grapheme. But you will never accidentally slice in the middle of a code point and end up with something that isn’t Unicode – the string APIs Python exposes simply don’t allow you to do that (the
bytes
type would, of course, let you do that, but the whole point is thatbytes
is notstr
).The absolute ideal, of course, is a language where the string abstraction is a sequence of graphemes. Swift does that. But the second-best in my opinion is a language where the string abstraction is a sequence of code points. Far, far, far to the rear of that in last place is a language where the “string” isn’t a string at all but a sequence of bytes. It’s hard enough already to handle the complexity of Unicode; having to do it while also keeping track of the raw bytes and the way they map to higher-level Unicode abstractions is just too much to pile on.
I know Rust doesn’t take this view (and probably 95% of my complaints would go away if Rust just called it
UTF8Bytes
instead of using string-y names), but the more I see of how people mess up just Unicode itself, the less I want to give them yet other layers to mess up while they’re at it.You don’t in Rust either. As I said, if your slice offsets split a codepoint, then the slicing operation panics. (And there are fallible APIs for when you don’t want a panic.)
To be honest, I don’t know what you’re talking about. You don’t have to keep track of anything. I’ve implemented many of Unicode’s algorithms in Rust, which in turn are used to provide nice higher level APIs.
It is plausible that a sequence of graphemes is the nicest representation in theory, but one of the nice properties about using UTF-8 directly is that there are very little costs associated with it. And at least with respect to the codepoint representation, you can write code that deals with exactly that encoding and only that encoding. (If Python strings use one of three possible representations and you need to implement some lower level primitive on it, then it seems to me like you either need to make it work over all of its possible representations or you need to pay to force the string into a particular representation.)
I cannot really fathom how most of your complaints would go away by just changing the name. A string by any other name is, still, just a string.
My view comes down to the simple observation that bytes are not strings. Forcing the “string” type to actually be a byte sequence comes with a whole pile of issues. For example, you mention that:
This only happens when the “string” type is actually a sequence of things that aren’t some atomic unit of Unicode, such as when the “string” is really a sequence of encoded bytes.
Python used to do this – the old Python 2
str
type was actually a byte sequence, and even up to 3.3 it was more often a sequence of UTF-16 code units, and that was a miserable approach for writing many types of applications, because all that does is add yet another layer of complexity on top of what you already need to know just to successfully work with Unicode. For example:In my ideal string implementation, the language or at the very least its standard library would provide sufficient implementation of Unicode, its abstractions, and its algorithms that implementing yourself outside of that core support wouldn’t be an idea that would occur to anyone, let alone something you actually might have to do.
And again it comes back to the fact that a byte sequence is not a string. A byte sequence doesn’t know the semantics of a string. And byte sequence is very difficult to bolt the semantics onto, and even more difficult to bolt ergonomic, natural-feeling string APIs onto.
I know Rust has decided to go with byte sequences anyway. I know it’s done for reasons that, to the folks behind Rust, seem perfectly reasonable and sync up with their goals for the language. But in the world where I work (primarily web apps, a world Rust seemingly would like to expand into at some point), we spent most of this century so far demonstrating at length why byte sequences are not strings and are a poor substitute for them.
Many of my complaints would go away in the same sense that many of my complaints about Python’s byte-sequence type went away when it stopped being called
str
and started being calledbytes
, and stopped carrying expectations that it could take the place of a real string type.If Rust just said “we have
UTF8Bytes
, and implementing string operations and APIs is an exercise left to the reader, or to the first person to do a good crate for it”, I’d be less down on it. But Rust insists on calling it a string type when it very plainly isn’t.Very strongly disagree. You can be prescriptive about this, but it’s descriptively not true. Many programming languages define “string” as some variant of a bundle of bytes. Sometimes it’s “arbitrary non-NUL bytes.” Sometimes it’s, “arbitrary bytes, but conventionally UTF-8.” Sometimes it’s, “UTF-8 validity required.” But the point is that this representation is exposed, and these things are called strings. You saying they aren’t just makes thing confusing.
You could make an argument that they shouldn’t be strings, and I would say that I have no interest in that argument because it’s just chasing tails around the maypole at this point. But to say that any of the above things factually are not strings is just in contradiction with common vernacular.
So?
There are many things that contributed to Python 2 strings being miserable. Python 3 strings have been pretty miserable for me too. But if I’m being honest, that just might be because Python is unityped. I don’t know. And I haven’t used Python in anger since it started getting type hints. I don’t know how effective they are. I probably won’t ever know.
I am one of the people doing that work. I maintain Rust’s regex library. I maintain crates that define string types. I maintain the csv, aho-corasick, ucd-generate and more.
And this is shifting the goalposts anyway. You can’t be all things to all people. Even if you provide standard abstractions to various Unicode-y things, lower level libraries are still likely to want to build their own to optimize certain access patterns.
Go does it quite successfully. As does my bstr crate.
That doesn’t match my experience. When I ship a program that reads files on millions of developer systems, my program doesn’t know the encoding of that file. Discovering the encoding is a non-option because of performance. So what do I do? I treat the file as a single contiguous string of bytes. Doing anything else is a non-starter.
Despite this, my tooling (ripgrep), has very good Unicode support. Certainly better than GNU grep and most other grep tools out there.
I don’t think you have. Actually, I haven’t seen one good reason from you yet why they are bad.
But Python 3 did a lot more than just rename some types. And the naming wasn’t really a matter of “incorrectly” calling byte sequences “strings,” but rather, having both a
str
and aunicode
type. Now that’s weird. So fixing the names wasn’t just about moving off of byte sequences, but just making them holistically sensible. Moreover, Python 3 has made a lot of other changes with respect to string types in order to make it work. In this conversation, you’re acting as if all the problems boiled down to calling byte sequences strings, and once that was fixed, all was good. But that’s a pretty narrow and incomplete view of what happened.Again, you’re conflating is and ought. You’re more than welcome to advocate for (many) programming languages to stop using the word “string” unless it meets your very specific representational requirements. But I don’t think you’ll get very far.
My experience is that the representation doesn’t make a whole lot of difference to whether something can be effectively used as a textual type or not. You’ve given literally zero examples of something useful that a Python string can do that a Rust string cannot. The only possible answer to that is that a Python string has constant time index/slicing by codepoint, but as I’ve covered already, this is almost never what you want and makes it easy to split grapheme clusters.
Thanks for the detailed reply. For context on my end, I work on a multi-lingual Python app, and Py2 -> Py3 really helped to cement my belief that Python is “right” on this front. I recently also found an issue in an ocaml application that was basically linked to its (my belief, faulty) treatment of text as “a string of bytes”, so I do believe pretty strongly about a structure having some sort of guarantee about textual validity.
Python you have
str
(that I end up callingtext
in my head) andbytes
. Yourtext
is “textual input”, for the common understanding of that. It abstracts away the encoding details etc (a common source of bugs), and yeah, you’re not working with bytes! You want bytes? You gotta explicitly encode it. Thanks to that, any “text”-y operations tend to require you to deal with encodings seriously.I feel like Rust tries this somewhat with
String
‘s “this is always valid utf-8” stuff. After all, ifString
were just a bag of bytes, then you wouldn’t say something like this. But it ends up exposing byte representations through slicing anyways! And yeah, you get falliable APIs, but Python isn’t being inconsistent. It has slicing semantics for text that give text. Rust doesn’t have slicing semantics for text, it has byte slicing.In my ideal world (and , y’know, just me, with my specific wants and things I do/don’t care about) , we would have
ByteSequence
andText
, so you could getString
-like APIs (I get thatString
is aVec
wrapper, but that’s also the problem IMO), but you could differentiate between “I actually don’t care what’s in this data/I want the raw bytes” vs “This is meant to be consumed as text” (for various definitions). That way you would be forced to establish where text-ness is relevant.I think that Rust’s current
str
/String
semantics try too hard to support emojis out of the box (glib for “nice multilingual text support” :) ), while also trying to be a mostly-drop-in replacement forchar*
. If we were splitting things up to separate types (With the rightimpl
coverage to “cover bases” as needed) I think there would be a lot less confusion, without having to have the philosophical argument about string slicing over and over and over again. Like if you want to work with bytes, work with bytes! (The one dilemna would be thatText
would either have to do something like Python’s fixed-width normalization, or it would have O(n) slicing…. but at least when I gotta break outdecode
/encode
in Python, it’s explicit and I don’t have much trouble answering what I need)EDIT: just another on the “wrong thing” of splitting graphemes… I think you’re not wrong, and I’ve generally not had issues (despite dealing with text that does have potential grapheme splitting issues), and I think a big part of it is very good text APIs for splitting etc, so I’m not doing index math in the first place. Still would want a non-String alternative but it’s a bit hard for me to say that split graphemes are “right”.
I will say that I feel like this conversation has gotten twisted into something I did not intend. I guess I should have left out the bit about “Python making the wrong thing easy.” I guess to me, it seems like a pretty obvious fact: it exposes natural codepoint indexing and slicing APIs, and when you use those without awareness of graphemes, those APIs will let you split a grapheme and wind up with garbled text.
I guess to me, saying that felt like a fact. But based on the replies I got from you and @ubernostrum, I think it came off as a more general claim about which string is “better” and thus a debate ensues.
My previous comment was meant more for educational means, to fill in gaps I saw. Not to spawn off a “my language’s string type is better!” debate.
You also have to realize that we have some really intense biases at play here, even aside from language preference. For example, I work on low level text primitives, and have even built a string library that specifically works on raw bytes, without the UTF-8 invariant. I also work on CLI tools that use these primitives and their primary goal is to be fast. In these contexts, I love that Rust’s string type is just a bag of bytes, because it means there’s no extra steps or indirection required for the implementation of the lower level primitives. In other words, the way Rust’s string is defined makes it easy to not only write fast code, but correct code.
But that’s only one angle at which to look at string APIs. There are other angles and use cases and domains.
So maybe we just have a more fundamental disagreement. It sounds to me like you specifically want “index/slice by codepoint” APIs, where as I think we shouldn’t.
But yes, exactly, you’re getting at the heart of the matter with the performance implications here.
Sorry, don’t mean for this to be a flame war (talking about this helps me think about it too), I think like @ubernostrum if we had multiple string types I’d have less gripes.
But ultimately it’s not the worst thing (at least it’s not bag of ASCII right?) and I like the language! And… yeah, I think “fast and correct” is definitely a rust philosophy. And hey, I can use both languages for various things!
Thanks for the info! My knowledge was over-simplified then.
A very literal formulation:
But this is prone to error. What if the input only has three characters? What if there’s a multi-byte unicode character spanning the 4th and 5th bytes? We’ll get a runtime panic. Since the slice is over the bytes, do we want multi-byte characters to become one * or multiple *?
I much prefer the original formulation in so far as it concisely expresses the intent of the transformation as a map on the characters in the string. Every operation feels essential and cannot fail.
I mean none of the operations you listed can fail either, when you have a slicing semantic that accept “running off the edge”. Just a different way of looking at stuff.
I guess I’m not a fan of “treat strings like a bunch of characters to walk over”. in this specific case. It feels too much like busywork of trying to solve for the most efficient implementation of something (that potentially doesn’t matter in the first place). Just my opinion tho
I know it’s not 100% on topic, but it’s still a great conversation point. I am not an expert Rustacean (more of a wannabe at this stage), so I also would love to know if there is any terser way to do that same string manipulation…
If you know your secret is ASCII—and you might actually in this case—then you can validate it as such and then make ASCII encoding assumptions later in your code. An example of an ASCII assumption would be that every character is exactly one byte, and thus, you could use
&self.secret[..3]
to get the first three characters.If you cannot make an ASCII assumption and
secret
really can be any arbitrary UTF-8 encoded text, thenchars()
is pretty good, but it would probably be even better to use graphemes.The other interesting bit here is that @rtpg’s solution will throw an exception if
len(secret) <= 2
, while your code will not, but will instead display the entirety of the secret. My guess is thatlen(secret) <= 2
can’t actually happen.My suspicion is that in practice, there is a particular format that you might expect
secret
to be. (Perhaps a certain length, a minimum length, all hex digits, etc etc.) If that’s true, then I would make construction ofCredentials
fallible, and return an error whensecret
(and perhaps alsoapi_key
) doesn’t satisfy your assumptions. Later on, you can then use those assumptions in various ways to make code simpler or faster (not that the latter is needed here).With all that said, I think the way you wrote it is just fine. Secrets are unlikely to be short, and secrets are unlikely to contain multi-codepoint graphemes. And if they do, the worst that will happen is that your obfuscated output will become slightly more obfuscated.
In Python this doesn’t throw an exception, because “string * negative number” gives the empty string (and slicing allows for out of bounds stuff). It’s the semantics that “just work” for common string munging use cases
I agree with your general sentiment about a defined credential shape. Not to mention that in practice you probably wouldn’t even need to have more than the first three characters (does the number of stars really need to align with the underlying secret?) This is definitely a problem where you would find something decent if you had more context.
Oh interesting. I’ve been writing Python for over a decade and I just assumed that providing an invalid end bound to a slice would fail. e.g., I assumed that
'ab'[:3]
would throw an exception. TIL.(Although, I have not written much Python in the fast few years, so perhaps I knew this at one point and my knowledge has atrophied.)
Hey, thanks to both for the great suggestions (I didn’t know about graphemes and it’s great learning!).
For the sake of the article, I just wanted to provide a simple example, so I didn’t spend too much time thinking about the shape of the data and the validation constraints. I think I’ll keep the example as it is, but nonetheless, I loved seeing how you reasoned about the data to determine constraints and how those would affect the code.
Well, thanks again :)
Yeah totally, makes sense! It’s always fun to jump into the well of complexity that is strings.