This isn’t really the essential part of your article, but seeing that string manipulation to get your asterisks halfway through the article made me wonder if we can do better
&self
.secret
.chars()
.enumerate()
.map(|(i, c)| if i < 4 { c } else { '*' })
.collect::<String>(),
In Python I would probably do
secret[:3] + '*' * len(secret[3:])
I wonder what the cleanest way to do this in Rust would look like…
Implicit in burntsushi’s response, but to make it clear: It’s done that way in Rust because Rust strings are UTF-8 encoded, so you can’t slice at arbitrary locations. There’s no way to know where the character boundaries are besides iterating over the string, unless you know your string only contains 1-byte characters. I would like to see the equivalent code in Go; I expect it looks like Rust.
Python unicode strings are all UTF-32, or whatever you call it when each character (codepoint? whatever) is 32 bits long, so you can index on word boundaries and always get something semi-sensible-ish out.
Python unicode strings are all UTF-32, or whatever you call it when each character (codepoint? whatever) is 32 bits long, so you can index on word boundaries and always get something semi-sensible-ish out.
This is incorrect.
Prior to Python 3.3, internal Unicode storage was a compile-time flag and set the storage for every Unicode object that interpreter would ever construct: “narrow” builds of Python used two-byte code units and surrogate pairs, while “wide” builds used four-byte code units. This had the disadvantage of wasting memory on strings containing only low code points, and the misfeature that on “narrow” builds surrogate pairs leaked to the programmer when certain operations – iteration, indexing, etc. – were performed.
Python 3.3 implemented PEP 393, and since then the internal storage of Unicode has been chosen on a per-string basis. When constructing a str
object, Python determines the highest code point in the string, and then selects the narrowest encoding capable of handling that code point in a single code unit: latin-1, UCS-2, or UCS-4.
As a result, Python’s internal storage for any given string is fixed-width, which is nice for APIs, but without needing to waste tons of space to store, say, a string containing only ASCII code points into a multi-byte encoding. This also sometimes beats a pure UTF-8 approach, because UTF-8 would need two-byte code units in some strings that Python can do entirely with one-byte units.
Interesting, thanks. I believe the salient point here though is that Python string indexing/slicing is done, logically, by codepoint, regardless of the internal representation. While @icefox might have gotten the representation details wrong, do you agree with the central point?
Python splits by code point, in Objects/unicodeobject.c:PyUnicode_Substring
, the relevant bit is
kind = PyUnicode_KIND(self);
data = PyUnicode_1BYTE_DATA(self);
return PyUnicode_FromKindAndData(kind,
data + kind * start,
length);
So yeah, you end up splitting on codepoints.
Rust chars
splits on Unicode Scalar Values. According to this Unicode glossary Unicode Scalar Values are code points except for surrogates.
I’m not super well versed in Unicode stuff, but my gut reading here is that Rust is effectively getting the same results as Python in practice here? Except Python decides to pay a bit of a representation cost for strings that require it by opting for UCS-4 when you have multi-byte stuff, in exchange for easier slicing semantics. In any case Python’s solution seems to be “just don’t use variable-byte sequences”. Both Python and Rust end up with some definition of character anyways.
I think that Rust’s values on this stuff are different than Python. I think Python tries to do “the right thing” for the most stuff possible, and is fine with inconsistent performance to get that working, where Rust will opt for APIs with consistent performance characteristics (see also how std::String doesn’t opt for small string optimization). Which is fine, of course.
This is all moot if Unicode Scalar Values in fact do a bunch of character collapsing and get you to a “real” definition of chars
, but looking at the implementation etc, I don’t feel like Rust solves the “what is a character even?” question for Unicode that makes chars
smarter than a Python character index operation. Really want a set of falsifiable operations for these discussions (like you have example calculations for floats/doubles to illustrate points better…)
The main difference, at a high level, between Rust and Python strings is that the representation of a Rust string is an API guarantee: it is UTF-8. Consequently, there are various APIs that make use of this guarantee. For example, in Rust, string[i]
is not defined, but string[m..n]
is, where m
and n
are byte offsets into the UTF-8 encoded bytes that make up string
. If m
or n
fall on an invalid UTF-8 boundary or are otherwise out of bounds, then the operation panics. (In Rust-speak, that’s an “infallible” API, since the caller is responsible for ensuring the inputs are correct. But there are also less-ergonomic “fallible” APIs that don’t panic.)
In this particular case, you’re correct, I believe you wind up with the same thing. Because the OP isn’t slicing by byte offset, but rather, is specifically accumulating the codepoints from the string. Importantly, Rust doesn’t expose any APIs for indexing or slicing by codepoint. That is, Python’s indexing/slicing operation simply does not exist in Rust land. This is somewhat a consequence of fixing to a UTF-8 representation, but is also somewhat philosophical: a codepoint doesn’t have a ton of semantic value on its own. Codepoints are more like an implementation detail of Unicode specifications. It’s an atom on which Unicode is built. But codepoints are only incidentally “visual characters” or graphemes in some cases (admittedly, many). So if you write code that assumes “one visual character” is “one codepoint,” well, then, that code is going to wind up being incorrect. For example, in the OP’s code (and yours), you might split a grapheme in part and end up with text that is rendered in a garbled form.
What I’m circuitously getting at here is that Python makes the wrong thing easy, but also, to be fair, it might take you a while to notice that it’s the wrong thing. And it depends on what you’re doing. For example, if secrets are really always ASCII (even if never validated as such), then you’ll never see any ill effects: the OP’s and your code will always produce the desired results. It will even do so if the text contains non-ASCII text, but otherwise has no multi-codepoint graphemes (perhaps it’s in NFC form, although that just decreases the likelihood of observing multi-codepoint graphemes).
This issue is related: https://github.com/BurntSushi/aho-corasick/issues/72
(Apologies if you know some of this stuff already. I got a little verbose, mostly because I have a lot of experience explaining this to people, and I get a lot of blank stares. It took a long time for this stuff to click for me as well, and largely that happened only after I started implementing some of the Unicode spec.)
What I’m circuitously getting at here is that Python makes the wrong thing easy, but also, to be fair, it might take you a while to notice that it’s the wrong thing.
Amusingly, I might say the same about Rust, and anything else which is fundamentally built around a “string” type that’s actually a sequence-of-encoded-bytes.
In Python you may slice in the middle of a grapheme. But you will never accidentally slice in the middle of a code point and end up with something that isn’t Unicode – the string APIs Python exposes simply don’t allow you to do that (the bytes
type would, of course, let you do that, but the whole point is that bytes
is not str
).
The absolute ideal, of course, is a language where the string abstraction is a sequence of graphemes. Swift does that. But the second-best in my opinion is a language where the string abstraction is a sequence of code points. Far, far, far to the rear of that in last place is a language where the “string” isn’t a string at all but a sequence of bytes. It’s hard enough already to handle the complexity of Unicode; having to do it while also keeping track of the raw bytes and the way they map to higher-level Unicode abstractions is just too much to pile on.
I know Rust doesn’t take this view (and probably 95% of my complaints would go away if Rust just called it UTF8Bytes
instead of using string-y names), but the more I see of how people mess up just Unicode itself, the less I want to give them yet other layers to mess up while they’re at it.
But you will never accidentally slice in the middle of a code point and end up with something that isn’t Unicode
You don’t in Rust either. As I said, if your slice offsets split a codepoint, then the slicing operation panics. (And there are fallible APIs for when you don’t want a panic.)
Far, far, far to the rear of that in last place is a language where the “string” isn’t a string at all but a sequence of bytes. It’s hard enough already to handle the complexity of Unicode; having to do it while also keeping track of the raw bytes and the way they map to higher-level Unicode abstractions is just too much to pile on.
To be honest, I don’t know what you’re talking about. You don’t have to keep track of anything. I’ve implemented many of Unicode’s algorithms in Rust, which in turn are used to provide nice higher level APIs.
It is plausible that a sequence of graphemes is the nicest representation in theory, but one of the nice properties about using UTF-8 directly is that there are very little costs associated with it. And at least with respect to the codepoint representation, you can write code that deals with exactly that encoding and only that encoding. (If Python strings use one of three possible representations and you need to implement some lower level primitive on it, then it seems to me like you either need to make it work over all of its possible representations or you need to pay to force the string into a particular representation.)
I know Rust doesn’t take this view (and probably 95% of my complaints would go away if Rust just called it UTF8Bytes instead of using string-y names), but the more I see of how people mess up just Unicode itself, the less I want to give them yet other layers to mess up while they’re at it.
I cannot really fathom how most of your complaints would go away by just changing the name. A string by any other name is, still, just a string.
My view comes down to the simple observation that bytes are not strings. Forcing the “string” type to actually be a byte sequence comes with a whole pile of issues. For example, you mention that:
in Rust, string[i] is not defined, but string[m..n] is, where m and n are byte offsets into the UTF-8 encoded bytes that make up string. If m or n fall on an invalid UTF-8 boundary or are otherwise out of bounds, then the operation panics.
This only happens when the “string” type is actually a sequence of things that aren’t some atomic unit of Unicode, such as when the “string” is really a sequence of encoded bytes.
Python used to do this – the old Python 2 str
type was actually a byte sequence, and even up to 3.3 it was more often a sequence of UTF-16 code units, and that was a miserable approach for writing many types of applications, because all that does is add yet another layer of complexity on top of what you already need to know just to successfully work with Unicode. For example:
I’ve implemented many of Unicode’s algorithms in Rust, which in turn are used to provide nice higher level APIs.
In my ideal string implementation, the language or at the very least its standard library would provide sufficient implementation of Unicode, its abstractions, and its algorithms that implementing yourself outside of that core support wouldn’t be an idea that would occur to anyone, let alone something you actually might have to do.
And again it comes back to the fact that a byte sequence is not a string. A byte sequence doesn’t know the semantics of a string. And byte sequence is very difficult to bolt the semantics onto, and even more difficult to bolt ergonomic, natural-feeling string APIs onto.
I know Rust has decided to go with byte sequences anyway. I know it’s done for reasons that, to the folks behind Rust, seem perfectly reasonable and sync up with their goals for the language. But in the world where I work (primarily web apps, a world Rust seemingly would like to expand into at some point), we spent most of this century so far demonstrating at length why byte sequences are not strings and are a poor substitute for them.
I cannot really fathom how most of your complaints would go away by just changing the name. A string by any other name is, still, just a string.
Many of my complaints would go away in the same sense that many of my complaints about Python’s byte-sequence type went away when it stopped being called str
and started being called bytes
, and stopped carrying expectations that it could take the place of a real string type.
If Rust just said “we have UTF8Bytes
, and implementing string operations and APIs is an exercise left to the reader, or to the first person to do a good crate for it”, I’d be less down on it. But Rust insists on calling it a string type when it very plainly isn’t.
My view comes down to the simple observation that bytes are not strings.
Very strongly disagree. You can be prescriptive about this, but it’s descriptively not true. Many programming languages define “string” as some variant of a bundle of bytes. Sometimes it’s “arbitrary non-NUL bytes.” Sometimes it’s, “arbitrary bytes, but conventionally UTF-8.” Sometimes it’s, “UTF-8 validity required.” But the point is that this representation is exposed, and these things are called strings. You saying they aren’t just makes thing confusing.
You could make an argument that they shouldn’t be strings, and I would say that I have no interest in that argument because it’s just chasing tails around the maypole at this point. But to say that any of the above things factually are not strings is just in contradiction with common vernacular.
This only happens when the “string” type is actually a sequence of things that aren’t some atomic unit of Unicode, such as when the “string” is really a sequence of encoded bytes.
So?
Python used to do this – the old Python 2 str type was actually a byte sequence, and even up to 3.3 it was more often a sequence of UTF-16 code units, and that was a miserable approach for writing many types of applications, because all that does is add yet another layer of complexity on top of what you already need to know just to successfully work with Unicode.
There are many things that contributed to Python 2 strings being miserable. Python 3 strings have been pretty miserable for me too. But if I’m being honest, that just might be because Python is unityped. I don’t know. And I haven’t used Python in anger since it started getting type hints. I don’t know how effective they are. I probably won’t ever know.
In my ideal string implementation, the language or at the very least its standard library would provide sufficient implementation of Unicode, its abstractions, and its algorithms that implementing yourself outside of that core support wouldn’t be an idea that would occur to anyone, let alone something you actually might have to do.
I am one of the people doing that work. I maintain Rust’s regex library. I maintain crates that define string types. I maintain the csv, aho-corasick, ucd-generate and more.
And this is shifting the goalposts anyway. You can’t be all things to all people. Even if you provide standard abstractions to various Unicode-y things, lower level libraries are still likely to want to build their own to optimize certain access patterns.
And again it comes back to the fact that a byte sequence is not a string. A byte sequence doesn’t know the semantics of a string. And byte sequence is very difficult to bolt the semantics onto, and even more difficult to bolt ergonomic, natural-feeling string APIs onto.
Go does it quite successfully. As does my bstr crate.
That doesn’t match my experience. When I ship a program that reads files on millions of developer systems, my program doesn’t know the encoding of that file. Discovering the encoding is a non-option because of performance. So what do I do? I treat the file as a single contiguous string of bytes. Doing anything else is a non-starter.
Despite this, my tooling (ripgrep), has very good Unicode support. Certainly better than GNU grep and most other grep tools out there.
we spent most of this century so far demonstrating at length why byte sequences are not strings and are a poor substitute for them.
I don’t think you have. Actually, I haven’t seen one good reason from you yet why they are bad.
Many of my complaints would go away in the same sense that many of my complaints about Python’s byte-sequence type went away when it stopped being called str and started being called bytes, and stopped carrying expectations that it could take the place of a real string type.
But Python 3 did a lot more than just rename some types. And the naming wasn’t really a matter of “incorrectly” calling byte sequences “strings,” but rather, having both a str
and a unicode
type. Now that’s weird. So fixing the names wasn’t just about moving off of byte sequences, but just making them holistically sensible. Moreover, Python 3 has made a lot of other changes with respect to string types in order to make it work. In this conversation, you’re acting as if all the problems boiled down to calling byte sequences strings, and once that was fixed, all was good. But that’s a pretty narrow and incomplete view of what happened.
If Rust just said “we have UTF8Bytes, and implementing string operations and APIs is an exercise left to the reader, or to the first person to do a good crate for it”, I’d be less down on it. But Rust insists on calling it a string type when it very plainly isn’t.
Again, you’re conflating is and ought. You’re more than welcome to advocate for (many) programming languages to stop using the word “string” unless it meets your very specific representational requirements. But I don’t think you’ll get very far.
My experience is that the representation doesn’t make a whole lot of difference to whether something can be effectively used as a textual type or not. You’ve given literally zero examples of something useful that a Python string can do that a Rust string cannot. The only possible answer to that is that a Python string has constant time index/slicing by codepoint, but as I’ve covered already, this is almost never what you want and makes it easy to split grapheme clusters.
Thanks for the detailed reply. For context on my end, I work on a multi-lingual Python app, and Py2 -> Py3 really helped to cement my belief that Python is “right” on this front. I recently also found an issue in an ocaml application that was basically linked to its (my belief, faulty) treatment of text as “a string of bytes”, so I do believe pretty strongly about a structure having some sort of guarantee about textual validity.
Python you have str
(that I end up calling text
in my head) and bytes
. Your text
is “textual input”, for the common understanding of that. It abstracts away the encoding details etc (a common source of bugs), and yeah, you’re not working with bytes! You want bytes? You gotta explicitly encode it. Thanks to that, any “text”-y operations tend to require you to deal with encodings seriously.
I feel like Rust tries this somewhat with String
‘s “this is always valid utf-8” stuff. After all, if String
were just a bag of bytes, then you wouldn’t say something like this. But it ends up exposing byte representations through slicing anyways! And yeah, you get falliable APIs, but Python isn’t being inconsistent. It has slicing semantics for text that give text. Rust doesn’t have slicing semantics for text, it has byte slicing.
In my ideal world (and , y’know, just me, with my specific wants and things I do/don’t care about) , we would have ByteSequence
and Text
, so you could get String
-like APIs (I get that String
is a Vec
wrapper, but that’s also the problem IMO), but you could differentiate between “I actually don’t care what’s in this data/I want the raw bytes” vs “This is meant to be consumed as text” (for various definitions). That way you would be forced to establish where text-ness is relevant.
I think that Rust’s current str
/String
semantics try too hard to support emojis out of the box (glib for “nice multilingual text support” :) ), while also trying to be a mostly-drop-in replacement for char*
. If we were splitting things up to separate types (With the right impl
coverage to “cover bases” as needed) I think there would be a lot less confusion, without having to have the philosophical argument about string slicing over and over and over again. Like if you want to work with bytes, work with bytes! (The one dilemna would be that Text
would either have to do something like Python’s fixed-width normalization, or it would have O(n) slicing…. but at least when I gotta break out decode
/encode
in Python, it’s explicit and I don’t have much trouble answering what I need)
EDIT: just another on the “wrong thing” of splitting graphemes… I think you’re not wrong, and I’ve generally not had issues (despite dealing with text that does have potential grapheme splitting issues), and I think a big part of it is very good text APIs for splitting etc, so I’m not doing index math in the first place. Still would want a non-String alternative but it’s a bit hard for me to say that split graphemes are “right”.
I will say that I feel like this conversation has gotten twisted into something I did not intend. I guess I should have left out the bit about “Python making the wrong thing easy.” I guess to me, it seems like a pretty obvious fact: it exposes natural codepoint indexing and slicing APIs, and when you use those without awareness of graphemes, those APIs will let you split a grapheme and wind up with garbled text.
I guess to me, saying that felt like a fact. But based on the replies I got from you and @ubernostrum, I think it came off as a more general claim about which string is “better” and thus a debate ensues.
My previous comment was meant more for educational means, to fill in gaps I saw. Not to spawn off a “my language’s string type is better!” debate.
You also have to realize that we have some really intense biases at play here, even aside from language preference. For example, I work on low level text primitives, and have even built a string library that specifically works on raw bytes, without the UTF-8 invariant. I also work on CLI tools that use these primitives and their primary goal is to be fast. In these contexts, I love that Rust’s string type is just a bag of bytes, because it means there’s no extra steps or indirection required for the implementation of the lower level primitives. In other words, the way Rust’s string is defined makes it easy to not only write fast code, but correct code.
But that’s only one angle at which to look at string APIs. There are other angles and use cases and domains.
(The one dilemna would be that Text would either have to do something like Python’s fixed-width normalization, or it would have O(n) slicing…. but at least when I gotta break out decode/encode in Python, it’s explicit and I don’t have much trouble answering what I need)
So maybe we just have a more fundamental disagreement. It sounds to me like you specifically want “index/slice by codepoint” APIs, where as I think we shouldn’t.
But yes, exactly, you’re getting at the heart of the matter with the performance implications here.
Sorry, don’t mean for this to be a flame war (talking about this helps me think about it too), I think like @ubernostrum if we had multiple string types I’d have less gripes.
But ultimately it’s not the worst thing (at least it’s not bag of ASCII right?) and I like the language! And… yeah, I think “fast and correct” is definitely a rust philosophy. And hey, I can use both languages for various things!
A very literal formulation:
let input = "this_is_my_input";
let mut output = String::new();
output.push_str(&input[..4]);
output.push_str(&"*".repeat(input.len()-4));
assert!(output == "this************");
But this is prone to error. What if the input only has three characters? What if there’s a multi-byte unicode character spanning the 4th and 5th bytes? We’ll get a runtime panic. Since the slice is over the bytes, do we want multi-byte characters to become one * or multiple *?
I much prefer the original formulation in so far as it concisely expresses the intent of the transformation as a map on the characters in the string. Every operation feels essential and cannot fail.
I mean none of the operations you listed can fail either, when you have a slicing semantic that accept “running off the edge”. Just a different way of looking at stuff.
I guess I’m not a fan of “treat strings like a bunch of characters to walk over”. in this specific case. It feels too much like busywork of trying to solve for the most efficient implementation of something (that potentially doesn’t matter in the first place). Just my opinion tho
I know it’s not 100% on topic, but it’s still a great conversation point. I am not an expert Rustacean (more of a wannabe at this stage), so I also would love to know if there is any terser way to do that same string manipulation…
If you know your secret is ASCII—and you might actually in this case—then you can validate it as such and then make ASCII encoding assumptions later in your code. An example of an ASCII assumption would be that every character is exactly one byte, and thus, you could use &self.secret[..3]
to get the first three characters.
If you cannot make an ASCII assumption and secret
really can be any arbitrary UTF-8 encoded text, then chars()
is pretty good, but it would probably be even better to use graphemes.
The other interesting bit here is that @rtpg’s solution will throw an exception if len(secret) <= 2
, while your code will not, but will instead display the entirety of the secret. My guess is that len(secret) <= 2
can’t actually happen.
My suspicion is that in practice, there is a particular format that you might expect secret
to be. (Perhaps a certain length, a minimum length, all hex digits, etc etc.) If that’s true, then I would make construction of Credentials
fallible, and return an error when secret
(and perhaps also api_key
) doesn’t satisfy your assumptions. Later on, you can then use those assumptions in various ways to make code simpler or faster (not that the latter is needed here).
With all that said, I think the way you wrote it is just fine. Secrets are unlikely to be short, and secrets are unlikely to contain multi-codepoint graphemes. And if they do, the worst that will happen is that your obfuscated output will become slightly more obfuscated.
In Python this doesn’t throw an exception, because “string * negative number” gives the empty string (and slicing allows for out of bounds stuff). It’s the semantics that “just work” for common string munging use cases
I agree with your general sentiment about a defined credential shape. Not to mention that in practice you probably wouldn’t even need to have more than the first three characters (does the number of stars really need to align with the underlying secret?) This is definitely a problem where you would find something decent if you had more context.
Oh interesting. I’ve been writing Python for over a decade and I just assumed that providing an invalid end bound to a slice would fail. e.g., I assumed that 'ab'[:3]
would throw an exception. TIL.
(Although, I have not written much Python in the fast few years, so perhaps I knew this at one point and my knowledge has atrophied.)
Hey, thanks to both for the great suggestions (I didn’t know about graphemes and it’s great learning!).
For the sake of the article, I just wanted to provide a simple example, so I didn’t spend too much time thinking about the shape of the data and the validation constraints. I think I’ll keep the example as it is, but nonetheless, I loved seeing how you reasoned about the data to determine constraints and how those would affect the code.
Well, thanks again :)
Am I the only one running IPsec/L2TP?
I do so for three reasons: server software comes preinstalled on my gateway (Mikrotik RouterOS), client software is included with iOS/macOS/Android/Windows, and AFAIK is secure (please let me know if not).
I’ve looked into Wireguard and I want to try it, but I don’t like my VPN server running on a host inside the network itself, which is much more probable to go offline and lock me out of the network, as opposed of it running on the very gateway to the network.
Any thoughts? I don’t have strong opinions regarding VPNs. Keep in mind I use them both for traffic encryption and for access to my network’s internal services.
Your setup may or may not be secure. No one can really say without looking at in detail, because the configuration for IPSec is pretty complex. Worse the protocol complexity induces complex client/server software which is prone to hard to spot implementation mistakes.
This is one of the main reasons I try to push people to Wireguard where there are no security relevant config options and the code base is very small. IIRC, Wireguard is about 4000 lines of code vs ~400 000 for an IpSec implementation.
As a quick example, CVE-2017-6297 was a bug in MikroTik’s L2TP client where IpSec encryption was disabled after a reboot. In general, I am quite sceptical of the security of dedicated devices like routers. They have fewer ‘good’ eyes on them due to the relative difficulty of pulling apart their hardware/firmware/closed source software and yet their uniformity makes them an attractive target for well resourced attackers.
L2TP/IPsec can be problematic with hotel wifi and other braindead networks. Not even NAT-T and IKEv2 always help. OpenVPN will cheerfully work even with double (or quadruple) NAT. Nothing against Wireguard, but I didn’t find it nearly as easy to manage and unproblematic as OpenVPN, especially when performance is not a big concern.
I wonder if the future is self-hosted VDI rather than VPN. It’s convenient for use on the road (just reconnect to a session), and much harder to ban, regulate, or persecute people for in countries with censorship.
The first important question to think about is whether the country you live in might one day persecute individuals who use VPNs / censorship avoidance software. If this is a risk, none of the solutions above are effective*, nor are Wireguard/OpenVPN/IPSec. The only potentially viable option is Tor, configured with a bridge and a good pluggable transport and even that is not without risk.
If persecution is not a risk, I would highly recommend Wireguard. It’s entry in your table would be: Yes, Android/iOS, Everything, UDP. Additionally, it is exceedingly simple to configure (as easy as SSH, if not easier), it has first class security and is extremely performant.
*All of these solutions either transmit a protocol identifier in the clear (e.g. a magic number) or can easily be detected from basic statistical analysis when used as a VPN (SSH).
Former VPN engineer here…
Tor invests a lot into obfuscation as one prong of its circumvention around censorship. I highly recommend checking into obfsproxy and in particular obfs4.
That said, recommending a circumvention technology is also impossible without knowing context. @nalzok, given your Github username, are you in China?
If I’m correct, I’m surprised dsvpn is working well for you. Most ISPs have switched to ratelimiting encrypted traffic to any IP outside the firewall; the last time I was there, only a few landbased fiber ISPs hadn’t updated their tech yet. Furthermore, 10/1 was a huge firestorm for VPN blocks.
Please be careful!
A steganographic VPN could posssibly work, too, I’ve seen some on github before, but can’t find any right now. Using a Tor bridge uses some steganography, but I don’t know how advanced it is. IIRC it just tunnels through HTTPS.
In general, I would say this sort of content is not the kind of content that makes it to books. You’ll probably have more luck by:
Well, actually this is what I did so far.
I was looking for something more solid to “verify” and improve my current design and fix obvious issues that my ignorance hide to me. Indeed I’ve found very interesting the Sharp’s idea of using CSP to formally define and verify protocols designs (my first thought was “I guess that nobody ever tried to verify HTTP! :-D”, but actually I don’t know if any of the mainstream protocols I cited have been formally verified).
Nevertheless, please share any specific paper (or insightful rant) that belong to these categories.
If you are interested in formal verification of network protocols, I’d definitely recommend checking out the Tamarin Prover. It’s been used to formally verify protocols like TLSv1.3, DNP3 (used in SCADA systems) etc. David Wong did a video introduction if you prefer a guided tour.
One of the authors of Tamarin also co-authored a book on the formal verification of network protocols. It’s focused more on the theory than on engineering, and gives a reasonably comprehensive overview of the area.
Bias disclosure: I’m doing a Phd under the supervision of one of the authors
It helps to go where prior efforts went to leverage whatever they can teach you. TLA+ is great for protocols but most work is about model-checking. That’s only partial verification up to certain amount of interactions or states. Other tools are used for full verification more often. What you can do is start with TLA+ to see the benefits without investing tons of time before bigger investment of learning formal proof. For same reason, one can use TLA+ on a protocol design first to get a quick check on some properties before proving it. We have just enough data in that I recommend specification-based testing, model-checking, and fuzzing of anything before proving. Those quickly catch lots of problems with proof catching the deepest ones where the state space would’ve just never let you get there with other methods.
The sad truth is that all I’m currently able to do, is to integrate the protocol in my kernel and write some server for it.
Unfortunately, while writing a server is relatively simple, integrating protocol into the kernel is a pretty complex task, probably the most complex refactoring I faced so far, because while the 9P2000 protocol handling is pretty well isolated in a Plan 9 kernel, the underlying model is strongly tighted into… everything.
One could write an alternative test client that is not a kernel (and that’s probably what I will do actually), but… at some point I’ll have to integrate it into the Jehanne’s kernel anyway, and I’m a bit afraid of what issues a testing client could hide.
That’s why I’d like to somehow “check” the design before diving into such a C-refactoring nightmare…
Look into state machine modeling and Design-by-Contract. Put in pre-conditions and optional post-condition checks on each of the modules. You do that on protocols, too. Feed data into them generated specifically to test those and from fuzzers. Any errors are going to be inputs causing bad state or interactions. That’s the easy approach. DbC is mostly assertions btw if you don’t want to learn state-based modelling.
I like the idea, although the field “Disclouse:Full/Partial/None” is too open to interpretation to be useful I think.
Another issue is that whilst the RFC request the PGP key be served over HTTPS, it doesn’t specify that security.txt also be served over HTTPS. This makes the PGP key HTTPS irelevant, as an attacker could MiTM and specify a different location for the file.
HTTPS is useless if the attacker can MITM the server, since the attacker can just obtain a LetsEncrypt certificate.
If you are worried about the attacker MITM you, then you should try from multiple locations.
SEO shitbags rank with email spammers as the absolute lowest pigshit dirtfuck dregs of humanity.
Is this really the standard of article we want to see here?
The author seems pretty ill informed as well:
If people don’t want to see my site with random trash inserted into it, they can choose not to access it through broken and/or compromised networks.
Earlier you recommended letsencrypt, and now suddenly you want me to pick a competent certificate authority
The author seems pretty ill informed as well:
Reposting the “ill informed” opinions without refutation or explanation doesn’t really have much value.
Is this really the standard of article we want to see here?
You’re new…maybe wait a bit and contribute more before hand-wringing. :)
Reposting the “ill informed” opinions without refutation or explanation doesn’t really have much value.
I included two quotes from the article to explain my point and I think they speak for themselves. Misquoting someone has negative value.
Great article!
In our eternal quest to fix the long tail of security issues, I think the next step after WebAuthn is to get more eyeballs on the TLS certificate racket (since it sounds like WebAuthn depends on it working). Which I want to rant about:
A bunch of old companies who were early movers got their root certs in lots of devices, and now we depend on this opaque list of dozens of random companies. And now OSes and browsers are making it difficult or impossible to change that list or even use self-signed certs.
For example, the vast majority of Python applications use certifi as the root store, which last I checked pulls all the certs over HTTPS from Firefox’s Mercurial instance, through an ancient Ubuntu box, into billions of clients!
https://github.com/certifi/python-certifi/blob/939a28ffc57b1613770f572b584745c7b6d43e7d/Makefile#L2
https://github.com/Lukasa/mkcert/blob/5e1e522b8dcafc3829fda1b8a1a09332b48b8798/main.go#L30
To clarify I don’t think certifi’s process is necessarily broken/vulnerable, just overly complex and fragile given the criticality. And really my main point is that this is the root cert handling process that we do see. Imagine the shitshow going on inside some of these ancient certificate authorities.
Oh that’s a fascinating piece of crucial internet security infrastructure I was unaware of.
Which bit is the “ancient Ubuntu box” - is that hg.mozilla.org?
Mozilla takes the security of its source code pretty seriously, e.g. they recently launched cargo-vet. I guess the Ubuntu box is https://mkcert.org/ which looks like its volunteer-operated.
Wow, the perfect https://xkcd.com/2347/
The machine hosting mkcert.org is the Ubuntu box, at least last I checked like a year ago. It’s a digitalocean IP and I found some other metadata (can’t remember where) that revealed an older LTS version. That may have changed though.