I maintain one github action, and testing it is a nightmare for the reasons outlined in “debugging like I’m 15 again”. I can’t run it locally. My action queries github to get state of various repos, and I can’t mock that either, so testing is pretty much manual with eyeballing what could be wrong. The whole ecosystem could be made so much better…
Yeah, but the golden path includes stuff written in Go because cross compilation is GOOS=linux GOARCH=arm64 go build. If I make my own linux implementation like Gokrazy, it’s gonna support cross compilation of C/C++ stuff with Zig.
Is there a tool for detecting if this affects you? I don’t write Go, but I imagine that I’d want to immediately apply this to all my code, but also it’s too hard to verify that it won’t cause a subtle yet significant bug somewhere in 30K lines of code.
The best Go projects use a metalinter to run 100s of various code checks. Mine will fail the test suite if there’s a lint error and I’m guessing the loopclosure lint is in there somewhere.
Is there a tool for detecting if this affects you?
from the article:
Tools have been written to identify these mistakes, but it is hard to analyze whether references to a variable outlive its iteration or not. These tools must choose between false negatives and false positives. The loopclosure analyzer used by go vet and gopls opts for false negatives, only reporting when it is sure there is a problem but missing others. Other checkers opt for false positives, accusing correct code of being incorrect. We ran an analysis of commits adding x := x lines in open-source Go code, expecting to find bug fixes. Instead we found many unnecessary lines being added, suggesting instead that popular checkers have significant false positive rates, but developers add the lines anyway to keep the checkers happy.
I maintain a daily job at work that builds go from tip and runs all unit tests against the latest go head. Today I added
GOEXPERIMENT=loopvar to it, thankfully(?) no tests failed. We did have a host of these capture problems, some bugs, some not really of consequence, at some point that were discovered by the recently added vet lint.
OT: I have recently fallen in love with gnuplot + kitty. I am still figuring out if and how to wrap my config with a Python module to allow for simpler notebook-style coding. I kinda dislike matplotlib interface, but I have successfully convinced it to plot to kitty as well. Anyway, it’s pretty magical to see crispy plots inside a teletype emulator.
“kitty” is a hard term to google but I take it you mean the terminal. Does look interesting, unlike many of the gpu-accelerated terminals which sell on speed (gnome-terminal seems fast enough to me…) this one also has some interesting feature ideas. Nice.
I dislike matplotlib’s interface too. I find it very idiosyncratic and unintuitive. I also dislike the fact that the image buffer is implicit global state. The data mungers at my workplace (a group of which I am not a part) prefer plotly and Tableau (which is an Enterprise thing). I do tend to use either matplotlib or gnuplot, depending on how complicated the graph needs to be.
It sounded like a “common usage”, and I go to urban dictionary for common terms. TBH I’ve never heard of any of the usages that other people posted here (except the place in LA, which it clearly isn’t referring to).
What’s clear is that the definition is not clear to anyone. And for that, including it here was a mistake given the unknown size of the audience and the fact that no one agrees what it means.
Urban Dictionary is full of bogus definitions. Look up a random noun (e.g. “candle”) and chances are the top definition is some sexual slang that no one has ever heard of, let alone would snigger at when someone uses it in the common sense.
Tar-pits around the world are great dig sites, because animals get stuck in them and die, but are preserved by the tar. The one in LA is one of the most (if not the most) significant such sites.
Was there was a period where paleontology was write larger in the collective consciousness? I feel like it was a widely thing when I was a kid around 1995.
Jurrasic Park was a big thing back then and got people interested in that kinda stuff. It’s actually really interesting how you can map pop culture stuff to the kinds of careers people want to get into later in life.
So many gotchas in this area. Just for starters people are frequently surprised by the fact that adding a single null to a series of ints turns it into a series of floats (with a NaN). And that really is just the first of umpteen surprises that lie in store in pandas’ type system.
And sometimes we can add a single null inadvertently, like with shift or diff functions. There are so many footguns with pandas, that I always have to double-check the results.
Gogit doesn’t even support the index (staging area), so instead of gogit add, you just gogit commit with the list of paths you want to commit each time. As pygit’s code shows, dealing with the index is messy. It’s also unnecessary, and I wanted gogit to be an exercise in minimalism.
What a feature! Finally a stateless commit command! Believe it or not, but I have been wanting this for a decade. Thank you for making this, so I wouldn’t have to!
I keep saying that the staging area is a stumbling block that shouldn’t be obligatory to use. It is trivial to squash commits, so it’s not like anyone really needs it to assemble a big commit (modulo self-imposed commit hooks). Rather, this hidden state is a constant source of surprise for us all, since what you commit is not what you see in git diff – if you added, moved, removed or checked out a file yesterday, and didn’t immediately commit or unstage it before you forgot about it, it’s now hidden state, and tends to sneak into the commit that you amend today.
It is trivial to squash commits, so it’s not like anyone really needs it to assemble a big commit
I actually end up using it mostly to assemble small commits: committing parts of my current state at a time when they’re ready, but without having to commit e.g. testing code I still need locally or other changes elsewhere in the file.
I usually abort the git add -p to modify a file slightly before committing, so I don’t do it in one step. I also like to git diff --staged before doing the commit, rather than committing and them amending several times to get it right.
It sounds like what you like about the staging area is that it gives you a notion of draft/private vs. final/public, is that right?
I use that distinction a bit less, or rather more: instead of one staging area in flux, I feel free to edit any commit that hasn’t been pushed/merged yet.
The draft/public distinction doesn’t require an index/staging area: Mercurial attaches it to commits, instead of to exactly one not-a-commit.
Specifically, Mercurial lets you mark commits as being in phase secret (won’t be pushed/pulled), draft (unpushed), or public (has been pushed/pulled).
[Edit because I submitted too soon:]
As for git diff --staged: See how the index concept forces Git to have a separate command? There is git diff for the working directory (but staged changes aren’t really in the working dir); and git log -p for any commit (but the index is not a real commit), so now you need a third invocation. If you sculpt the latest commit instead of an index, git log -p can run on HEAD no problem. (And in my daily work I do exactly this, many times per hour.)
For me, the difference is typically in when I want to remove some diff from my staged changes. It’s pretty easy to git reset the file and redo the git add. I’m not aware of a comparable “remove this file from the commit” command, so if I really need to do that, what I have to do is revert the commit, uncommit the revert-commit to get a negative diff of the commit, remove everything but the change I want to remove, then amend the commit with that change.
Splitting a commit is indeed a missing feature of git rebase, but git-revise has it, and there is a PR to support specifying paths while doing so (rather than answering yes/no to every hunk).
Before that, I also used to do the double-revert dance.
Tip: You can git commit --patch, just like git add --patch. Even with multiple paths if you have. Is that what you mean? It’s not much git add can do that git commit can’t, and it saves a step. Only one thing – committing previously untracked files – which Gogit has now solved.
I also use --amend a lot, or commit many small commits and then squash the related ones. Or if I have no idea to begin with, I just make one big commit at the end and iteratively split out pieces of it with git revise. Which is just as easy as staging out the same small commit from the same set of unstaged changes, with the benefit that it also works with commits that aren’t the last.
I don’t find it that big of an issue. Perhaps it is because I learned to do git status to orient myself, much like sysadmins habitually type ls like muscle memory.
(well, I have alias gs="git status", so it’s not as much typing as you’d think)
Most certainly. I never (intentionally) use the staging area, which is why I have no use for git status. Other than to assert that nothing is staged before I commit. I do that if I’m unsure, but it’s not a habit, which is how it surprises me again and again.
The ls thing is a reason that you shouldn’t actually use the command line for managing novel filespaces. It’s only good for filespaces you already roughly know.
What would be the benefit of BlazingMQ over Kafka? I don’t have much experience with either, but in casual use of Kafka at my company, it seems rather fast.
I don’t know specifics, but having friends in the investment world, my guess is Bloomberg wanted more specific guarantees of stability and speed for the Bloomberg Terminal. I have no knowledge that says that is where this is going to be used, but reading the description immediately made me think of that product (which they label under their “Give our clients an edge” section here):
We provide unparalleled 24/7 service to Terminal users all over the world, across multiple industries. We ensure our users have a seamless experience from the moment they install the Terminal, to learning how to leverage the power of the incredible Bloomberg tool on their desk every day.
Edit: And actually, they have a page devoted to that topic here. Again, they don’t specifically call out where they might use it, but again as a nod to the Terminal application, they list these places it will shinve:
High fan-out ratio with high data volume (sounds like the Terminal)
Mixed workload: different delivery guarantees for different data streams (which explicitly calls out the different needs between financial trade processing and other data)
This links to 2 short message long “discussion” on discourse, that links to a repository with almost empty readme file. Can you at least give some tl;dr or example? Otherwise, it’s a contentless post that I have to flag.
Sorry about the indirection, the main file to look at is flake.nix. There is not much documentation since this is just an experiment about Nix flakes, mostly for fun
Nix shell is IMO the only solution which comes close to sane push-button development environments. One command to install Nix itself (answer “yes” to all the prompts), then one command, nix-shell, to get a working environment with all the software, no matter what else is already installed on your machine (examples). As a bonus, since nix-shell --run COMMAND installs and runs in a single command, you can usually just run the CI build/test/install commands verbatim on your local machine, and expect them to work.
/rant The Python tools themselves change too often and too randomly to ever learn something that will work across projects and is still valid in five years. Last I checked pip still couldn’t create lockfiles with signatures, it’s got no way to tell which lockfile entries depend on which other ones (leading to ugly stuff like orphan dependencies over time and a majority of Dependabot/Renovate PRs breaking the build), and is just generally unusable.
I remember having so many frustrations with having a consistent Python environment on both my work-supplied MacBook and on a Linux server and none of the aforementioned tools met my needs, and then I replaced everything with a Nix shell and had no issues after that point. It was a minor pain to set up but entirely worth it.
Most documentation coming out now seems to be flake centric and it’s really not that messy. I find it hard to invest in a thing that’s already clear not going to be the thing.
a thing that’s already clear not going to be the thing.
At least some parts of the community (other than myself) have a very different view of the situation. I’m not going to name names, but check out the public community forums.
I’m tracking things relatively closely. Where would I see the alternative viewpoint/proposal?
Isn’t it fairly clear already if you see that many people publishing documentation around Nix are taking flakes as the starting point? They don’t seem to be aware that there is an alternative or they don’t think it’s a particularly good alternative.
I somehow find comfort when I read similar articles and they contain a section where the author outlines when they reached out to the company, how they treated his submission, and conclude the fix has been applied. This article is missing that.
“… but the servers have now been patched to make sure both seats don’t have the same account and JWT for matchmaking games” From this I gather they were responsive and, I assume, appreciative.
I initially thought this is mach-nix resurfaced, then realized this is only for python versions, not pypi packages. What is the perceived benefit of updating this repo every hour as opposed to every day?
Thanks. I have working IPv6 here but the v6 version doesn’t work for me - I get connection refused errors. It looks as if you have a valid AAAA record (and no A record), so my browser is trying to connect via v6, but failing.
It does look like something might not be configured right..
9:18:25 brd@m:~> curl -I https://cyrnel.net/solarwinds-hack-lessons-learned/ curl: (7) Failed to connect to cyrnel.net port 443 after 72 ms: Connection refused
9:18:27 brd@m:~> ping -6 cyrnel.net PING6(56=40+8+8 bytes) 2602:b8:xxxxxx --> 2603:6081:ae40:160::abcd 16 bytes from 2603:6081:ae40:160::abcd, icmp_seq=0 hlim=52 time=69.674 ms 16 bytes from 2603:6081:ae40:160::abcd, icmp_seq=1 hlim=52 time=72.403 ms 16 bytes from 2603:6081:ae40:160::abcd, icmp_seq=2 hlim=52 time=72.432 ms 16 bytes from 2603:6081:ae40:160::abcd, icmp_seq=3 hlim=52 time=71.435 ms
I’ve had a draft post sitting in a folder for years that’s titled “yes it is wrong” and making the argument that for many people who just need to work with text, code units are the least useful possible abstraction to use.
They come with at least all the downsides of code points (you can accidentally cut in the middle of a grapheme cluster)
Plus the downside of being able to cut in the middle of a code point and produce something that’s no longer legal Unicode
Plus if it’s UTF-16 code units, as in quite a few languages, it’s coupled to the weird historical quirks of that encoding, including things like surrogate pairs
And in general, I think high-level languages should not be exposing bytes or bytes-oriented abstractions for text to begin with. There are cases for doing it in low-level languages where converting to a more useful abstraction might have too much overhead for the desired performance characteristics, but for high-level languages it should be code points or graphemes, period.
And don’t get me started on the people who think operations like length and iteration should just be outright forbidden. Those operations are too necessary and useful for huge numbers of real-world tasks (like, say, basically all input validation in every web app ever, since all data arriving to a web application is initially in string form).
(and one final note – I think 5 is a perfectly cromulent length to return for that emoji, because it makes people learn how emoji work, just as a flag emoji should return a length > 1 to show people that it’s really composed from a country code, etc.)
Graphemes are the closest thing to how humans (or at least the majority of humans who are not Unicode experts) normally/“naturally”/informally think about text. So in an ideal world, the most humane user interface would state “max length n” and mean n grapheme clusters.
For some Western European scripts, code points also get kinda-sorta-close-ish to that, and are already the atomic units of Unicode, so also are an acceptable way to do things (and in fact some web standards are defined in terms of code point length and code point indexing, while others unfortunately have JavaScript baggage and are instead defined in terms of UTF-16 code units).
But why do you have a max length validation that is close to anything people would count themselves at all? In my experience, there’s two kinds of max length requirements: 1) dictated by outside factors, where you do what they do regardless of what you think about it or 2) to prevent people from like uploading movies as their username, in which case you can set it to something unreasonably high for the context (e.g. 1 kilobyte for a name) and then the exact details aren’t super important anyway.
But why do you have a max length validation that is close to anything people would count themselves at all?
Ask Twitter, which in the old days had a surprisingly complex way to account for the meaning of “140 characters”.
But more seriously: there are tons of data types out there in the wild with validation rules that involve length, indexing, etc., and it is more humane – when possible – to perform that validation using the same way of thinking about text as the user.
If you can’t see why “enter up to 10 characters” and then erroring and saying “haha actually we meant this weird tech-geek thing you’ve never heard of” is a bad experience for the user, I just don’t know what to say. And doing byte-oriented or code-unit-oriented validation leads to that experience much more often.
The real go-to example is the old meme of “deleting” the family members one by one when backspacing on an emoji. That shows the disconnect and also shows really bd UX.
I completely agree. To me, it’s about providing a string abstraction with a narrow waist. If you expose code units, then you are favouring a single encoding for no good reason. I want the string interface to make it easy for things above and below it.
Implementations of the interface should be able to optimise for storage space, for seek speed, for insertion speed, or any other constraint (or combination of constraints) that applies. Nothing using the interface should see anything other than performance changes when you change this representation. Exposing Unicode code points is a good idea here because every encoding has a simple transformation to and from code points. If you expose something like UTF-16 code units then a UTF-8 encoding has to expand and then recompress, which is more work. Iteration interfaces should not assume contiguous storage (or even that consumers have raw access to the underlying storage), but they should permit fast paths where this is possible. ICU’s UText is a good example of doing this well (and would be between with a type system that allowed monomorphisation), it gives access to contiguous buffers that might be storage provided by the caller (and can be on stack) and callbacks to update the buffer. If your data is a single contiguous chunk, it’s very fast, if it isn’t then it gracefully decays.
Similarly, it should be possible to build abstractions like grapheme clusters, Unicode word breaking, and so on, on top without these implementations knowing anything about the underlying storage. If you need them, they should be available even for your custom string type.
Just to clarify since I quoted that article, I would say that for a shell, it IS wrong that len(facepalm) is 7, but it’s ALSO wrong that len(facepalm) is 5.
For a shell, it should be 17, and of course you can have other functions on top of UTF-8 to compute what you want. You can transform UTF-8 to an array of code points.
One main reason is that the Unix file system and APIs traffic in bytes, and other encodings are not single encodings – they have a notion of byte order.
There is no room for a BOM in most of the APIs.
There is also no room for encoding errors. For example, in bash the string length ${#s} is a non-monotonic function of byte length because it gives nonsense values for invalid unicode. Basically you can have a string of 5 chars, and add 3 invalid bytes to it, and get 8. Then when you add the 4th valid byte, the length becomes 6.
If your language is meant to deal with HTTP which has a notion of default encoding and encoding metadata, a different encoding may be better. But Unix simply lacks that metadata, and thus all implementations that attempt to be encoding-aware are full of bugs.
The other huge reason is that UTF-8 is backward compatible with ASCII, and UTF-16, UTF-32 aren’t. It gracefully degrades.
A perfect example of this is that a Japanese developer of Oil just hit a bug in Oil’s Python 2 scripts, with LANG=*utf8, because Python 2’s default encoding is ASCII. The underlying cause was GNU $(date) producing a unicode character in Japanese.
Python 3 doesn’t fix these things – it took 6 years to choose a default file system encoding for Python 3 on all platforms, which is because it’s inherently an incoherent concept. You can have a single file system with 2 files with 2 different encodings, or 10 files with 10 different encodings.
So the default behavior should be to pass through bytes, not guess an encoding and then encode/decode EVERY string. That’s also slow for many applications, like a build system.
tl;dr A narrow waist provides interoperability, and UTF-8 is more interoperable than UTF-16 or UTF-32. Unix has no space for encoding metadata. Go-style Unicode works better for a shell than Python style (either Python 2 or 3).
UTF-8 is the only encoding that will give you a narrow waist
I think you’re missing my point. If the encoding is exposed, you’ve built a leaky abstraction. For a shell, you are constrained by whatever other programs produce (and that will typically be controlled by the current locale, though typically UTF-8 these days, at least for folks using alphabetic languages), but that’s a special case.
There is no abstraction, so it’s not a leaky one. Shell is not really about building “high” abstractions, but seamless composition and interoperability.
The leaky abstraction is the array of code points! The leaks occur because you have the problems of “where do I declare the encoding?” and “what do I do with decode errors?”, when 99% of shell programs simply pass through data like this:
ls *.py
The shell reads directly from glob() and sends it to execve(). No notion of encoding is necessary. It’s also pointless to encode and decode there – it would simply introduce bugs where there were none.
Are you arguing for a Windows style or Python style where the file system returns unicode characters? I’d be very surprised by that, but I don’t have time to get into it here
Maybe in a blog post – the 2 above illuminated the issues very thoroughly
The article explains why it’s not practical to expose something that’s not bytes, utf-16 code units, or utf-32 code units / scalars:
Earlier, I said that the example used “Swift 4.2.3 on Ubuntu 18.04”. The “18.04” part is important! Swift.org ships binaries for Ubuntu 14.04, 16.04, and 18.04. Running the program
So Swift 4.2.3 on Ubuntu 18.04 as well as the unic_segment 0.9.0 Rust crate counted one extended grapheme cluster, the unicode-segmentation 1.3.0 Rust crate counted two extended grapheme clusters, and the same version of Swift, 4.2.3, but on a different operating system version counted three extended grapheme clusters!
Basically because it’s not really a property of the language, but of the operating system.
So I’ll flip it around and say: instead of publishing why “it is wrong”, publish what’s right and we’ll critique it :-)
Between Python 3.0 released in 2008 and Python 3.4 released in 2014, the Python filesystem encoding changed multiple times. It took 6 years to choose the best Python filesystem encoding on each platform.
(This is because the file system encoding is an incoherent concept, at least on Unix. libc LANG= LC_CTYPE= is also an incoherent design. )
his article tells the story of my PEP 540: Add a new UTF-8 Mode which adds an opt-in option to “use UTF-8” everywhere”. Moreover, the UTF-8 Mode is enabled by the POSIX locale: Python 3.7 now uses UTF-8 for the POSIX locale.
Stinner is a hero for this …
The only reasonable argument for Python’s design is “Windows”, and historically that made sense, but I think even Windows is moving to UTF-8.
The number of bugs produced is simply staggering, not to mention the entire Python 2 to 3 transition.
A Go-style UTF-8 design is not just more efficient, but has fewer bugs. Based on my experience, the kinds of bugs you listed are theoretical, while there are dozens of real bugs caused by the leaky abstraction of an array of code points, which real applications essentially never use. LIBRARIES like case folding use code points; applications generally don’t. Display libraries that talk to the OS use graphemes; applications generally don’t.
It is difficult to reconcile your praise for “Go’s UTF-8 design” with your leaning so heavily on the specific criticism of handling filesystem encoding.
Assuming that everything will be, or can be treated as, UTF-8 is incredibly dangerous and effectively invites huge numbers of bugs and broken behaviors. This is why Rust – which otherwise is a “just use UTF-8” language – has a completely separate type for representing system-native “strings” – like filesystem paths – which might well turn out to be undecodable junk.
Also, I know of you primarily as someone who works on a shell, and so to you Python 3 probably did feel like a big step backwards, since I’m sure Python 2 felt a lot easier to you.
But to me the old Python 2 way was the bad old broken Unix way. Which is to say, completely and utterly unsafe and unfit for any purpose other than maybe some types of small toy scripts that will only ever exist in the closed ecosystem of their developer’s machine(s).
Basically everybody in the world outside of the niche of Unix-y scripting, as the very first thing they would do on picking up Python 2, had to go laboriously re-invent proper text handling and encoding/decoding (the “Unicode sandwich” model, as we used to call it), and live in fear that they might have made a mistake somewhere and would have a pager go off at 2AM because a byte “string” had accidentally gotten through all the defenses.
This was, if my tone hasn’t made it abundantly clear already, not a pleasant way to have to do things, and Python 2 only got to be pleasant (or somewhat pleasant) for many users because smart and patient people did all the tedious work of dealing with its busted broken approach to strings.
Python 3 in comparison is a massive step forward and a huge breath of fresh air. Does it probably make life seem more complex for you, personally, and for other people like you? Sure, though I’d argue that it only ever felt simpler in the past because you were operating in a niche where the complexity didn’t rise up and bite you as often or as obviously as the way it did me and people working in other domains, in large part due to historical Unix traditions mostly avoiding doing things with text beyond ASCII or, in a few favored places, perhaps latin-1.
But even if it did just make your life outright more difficult, I think it would be a worthwhile tradeoff; you’re dealing with a domain that really is complex and difficult, and it should be on you to find solutions for that complexity and difficulty, not on everyone else to deal with a language and an approach to strings that makes our lives vastly more difficult in order to pretend to make yours a bit easier. The old way, at best, kinda sorta swept the complexity (for you, not for me) under the rug.
No, the idea is not assuming everything is UTF-8. If you get an HTTP request that declares UTF-16, then you convert it to UTF-8, which is trivial with a library.
When you have a channel that doesn’t declare an encoding, you can treat it as bytes, and UTF-8 text will work transparently with operations on bytes. You can search for a / or a : within bytes or UTF-8; it doesn’t matter. You don’t need to know the encoding.
So the idea is that UTF-8 is the native representation, and you can either convert or not convert at the edges. The conversion logic belongs in the app / framework / libraries, not in the language and standard I/O itself.
That is how UTF-8 is designed; I don’t think you’re aware of that.
We’ll just have to leave this alone, but I don’t believe you’re speaking from experience. Everything you’ve brought up is theoretical (“this is how humans think, what humans want”), while I’ve brought up real bugs.
You’re also mistaking the user of the program with the programmer. UTF-8 is easy to understand for programmers, and easy to write libraries for – it’s not dangerous. The idea that UTF-8 is “incredibly dangerous” is ignorant, and you didn’t show any examples of “huge numbers of bugs and broken behaviors”.
When you have a channel that doesn’t declare an encoding, you can treat it as bytes, and UTF-8 text will work transparently with operations on bytes. You can search for a / or a : within bytes or UTF-8; it doesn’t matter. You don’t need to know the encoding.
Except UTF-8 does not “work transparently” like this, because at best there’s an extremely limited subset of operations you can do that won’t blow up. And that sort of sinks the entire idea that this thing is actually a “string” or “text” – if you’re going to call it that, people will expect to be able to do more operations than just “did this byte value occur somewhere in there”.
So the idea is that UTF-8 is the native representation, and you can either convert or not convert at the edges. The conversion logic belongs in the app / framework / libraries, not in the language and standard I/O itself.
Inside a program in a high-level language, the number of cases in which you should be working directly with a byte array or code-unit array and calling it “text” or “string” rounds to zero. At most, you could argue that Python should have gone the Rust route with a separate and explicitly not-text type for filesystem paths, But that still wouldn’t be an argument for having the string type be something that’s not a string (and byte arrays are not strings and code-unit arrays are not strings, no matter how much someone might want them to be).
That is how UTF-8 is designed; I don’t think you’re aware of that.
We’ll just have to leave this alone, but I don’t believe you’re speaking from experience. Everything you’ve brought up is theoretical (“this is how humans think, what humans want”), while I’ve brought up real bugs.
This is uncharitable to such an extreme degree that it is effectively just an insult.
But: Python 2’s “strings” were awful. They were unfit for purpose. They were a constant source of actual real genuine bugs. Claiming that they weren’t, or that Python 3 didn’t solve huge swathes of that all in one go by introducing a distinction between actual real strings and mere byte sequences, makes no sense to me because I and many others lived through that awfulness.
And recall that Python 3 – like Python 2 before it – initially did adopt code units as the atomic unit of its Unicode strings. And that still was a source of messiness and bugs, since it depended on compile-time flags to the interpreter and meant some code simply wasn’t portable between installations of Python, even of the exact same major.minor.patch version. It wasn’t until Python 3.3 that this changed and the Python string abstraction actually became a sequence of code points.
So if anyone needs to go re-read some Python history it’s you, since you seem to be thinking that something which wasn’t the case until Python 3.3 (string as sequence of code points) was responsible for trouble that dated back to prior versions (like I said, your argument really is not about the atomic unit of the string abstraction, it’s about whether filesystem paths ought to be handled as strings).
5 (codepoints) is sensible in exactly one case: communicating offsets for string replacement/annotation/formatting in a language and unicode version and encoding agnostic way.
Pretty great writeup, thanks!
I maintain one github action, and testing it is a nightmare for the reasons outlined in “debugging like I’m 15 again”. I can’t run it locally. My action queries github to get state of various repos, and I can’t mock that either, so testing is pretty much manual with eyeballing what could be wrong. The whole ecosystem could be made so much better…
Can one run other statically compiled tools, not written in go? As the text mentions busybox, I would assume so, but better to check.
Yeah, but the golden path includes stuff written in Go because cross compilation is
GOOS=linux GOARCH=arm64 go build
. If I make my own linux implementation like Gokrazy, it’s gonna support cross compilation of C/C++ stuff with Zig.you can run containers, so even if you can’t run the static binaries directly, you could shove them in a
FROM scratch
container and be done with it.(i don’t know the answer to your actual question, sorry)
Thanks for clarifying this. “Can you run containers in this?” is the specific thing I wanted to ask.
it’s worth checking out the docs, stapelberg has done an amazing job with documentation.
Is there a tool for detecting if this affects you? I don’t write Go, but I imagine that I’d want to immediately apply this to all my code, but also it’s too hard to verify that it won’t cause a subtle yet significant bug somewhere in 30K lines of code.
The best Go projects use a metalinter to run 100s of various code checks. Mine will fail the test suite if there’s a lint error and I’m guessing the
loopclosure
lint is in there somewhere.https://golangci-lint.run/usage/linters/
golangci-lint has quite a lot of linters. What would be recommended settings for small and mid-size projects?
Here is the CI workflow I’ve landed on after an enormous amount of trial and error and iteration. In order:
Notably, and deliberately, no golangci-lint.
from the article:
Yes, it can output a debug line for every place where the code it emits would be different.
I maintain a daily job at work that builds go from tip and runs all unit tests against the latest go head. Today I added GOEXPERIMENT=loopvar to it, thankfully(?) no tests failed. We did have a host of these capture problems, some bugs, some not really of consequence, at some point that were discovered by the recently added vet lint.
OT: I have recently fallen in love with
gnuplot
+kitty
. I am still figuring out if and how to wrap my config with a Python module to allow for simpler notebook-style coding. I kinda dislikematplotlib
interface, but I have successfully convinced it to plot tokitty
as well. Anyway, it’s pretty magical to see crispy plots inside a teletype emulator.“kitty” is a hard term to google but I take it you mean the terminal. Does look interesting, unlike many of the gpu-accelerated terminals which sell on speed (gnome-terminal seems fast enough to me…) this one also has some interesting feature ideas. Nice.
I dislike matplotlib’s interface too. I find it very idiosyncratic and unintuitive. I also dislike the fact that the image buffer is implicit global state. The data mungers at my workplace (a group of which I am not a part) prefer plotly and Tableau (which is an Enterprise thing). I do tend to use either matplotlib or gnuplot, depending on how complicated the graph needs to be.
Plotnine is a python implementation of ggplot2. It uses matplotlib as the backend. However, it is way more expressive and intuitive to use.
Does anyone know if the paper is publicly available?
The only link to the paper that I was able to find does not seem to work.
There is a similar paper from the same author: https://junchengyang.com/publication/hotos23-qdlp.pdf
https://s3fifo.com/ says “To Appear at SOSP” which seems to be scheduled for October. https://sosp2023.mpi-sws.org/
So I guess paper will be published then?
Is there a new one out or something?
No, this is just a product page link.
Hilarious.
I thought I’d repost since the Moonlander’s popularity has increased a ton since the last post and the last post didn’t get much traction.
I didn’t know what a tarpit was and hooo boy!, neither does the author
It’s a reference to the Mythical Man-Month’s first chapter “The Tar Pit”
That’s how I interpreted it, as the first paragraph ends with talking about increasingly wasted effort.
I don’t think Urban Dictionary is a reasonable way of trying to define things in “technical” articles.
I’m inclined to either guess it was meant quite literally as “you’ll be stuck in there if you go in” or maybe a reference to https://en.wikipedia.org/wiki/Turing_tarpit
It sounded like a “common usage”, and I go to urban dictionary for common terms. TBH I’ve never heard of any of the usages that other people posted here (except the place in LA, which it clearly isn’t referring to).
What’s clear is that the definition is not clear to anyone. And for that, including it here was a mistake given the unknown size of the audience and the fact that no one agrees what it means.
I’m not a native English speaker, but I’ve heard of it specifically in the networking sense.
Protip, don’t search Urban Dictionary for your definitions.
https://en.wikipedia.org/wiki/La_Brea_Tar_Pits
https://en.wikipedia.org/wiki/Turing_tarpit
https://en.wikipedia.org/wiki/Tarpit_(networking)
Urban Dictionary is full of bogus definitions. Look up a random noun (e.g. “candle”) and chances are the top definition is some sexual slang that no one has ever heard of, let alone would snigger at when someone uses it in the common sense.
I think it’s usually a reference to a famous place in Los Angeles
Tar-pits around the world are great dig sites, because animals get stuck in them and die, but are preserved by the tar. The one in LA is one of the most (if not the most) significant such sites.
Was there was a period where paleontology was write larger in the collective consciousness? I feel like it was a widely thing when I was a kid around 1995.
Jurrasic Park was a big thing back then and got people interested in that kinda stuff. It’s actually really interesting how you can map pop culture stuff to the kinds of careers people want to get into later in life.
So many gotchas in this area. Just for starters people are frequently surprised by the fact that adding a single null to a series of ints turns it into a series of floats (with a NaN). And that really is just the first of umpteen surprises that lie in store in pandas’ type system.
And sometimes we can add a single null inadvertently, like with
shift
ordiff
functions. There are so many footguns with pandas, that I always have to double-check the results.What a feature! Finally a stateless commit command! Believe it or not, but I have been wanting this for a decade. Thank you for making this, so I wouldn’t have to!
I keep saying that the staging area is a stumbling block that shouldn’t be obligatory to use. It is trivial to squash commits, so it’s not like anyone really needs it to assemble a big commit (modulo self-imposed commit hooks). Rather, this hidden state is a constant source of surprise for us all, since what you commit is not what you see in git diff – if you added, moved, removed or checked out a file yesterday, and didn’t immediately commit or unstage it before you forgot about it, it’s now hidden state, and tends to sneak into the commit that you amend today.
I actually end up using it mostly to assemble small commits: committing parts of my current state at a time when they’re ready, but without having to commit e.g. testing code I still need locally or other changes elsewhere in the file.
I, too, make small commits from a larger uncommitted change, but you don’t need a staging area for that.
Instead of sculpting a staging area and then committing, I sculpt the commit directly.
I usually abort the
git add -p
to modify a file slightly before committing, so I don’t do it in one step. I also like togit diff --staged
before doing the commit, rather than committing and them amending several times to get it right.It sounds like what you like about the staging area is that it gives you a notion of draft/private vs. final/public, is that right?
I use that distinction a bit less, or rather more: instead of one staging area in flux, I feel free to edit any commit that hasn’t been pushed/merged yet.
The draft/public distinction doesn’t require an index/staging area: Mercurial attaches it to commits, instead of to exactly one not-a-commit.
Specifically, Mercurial lets you mark commits as being in phase secret (won’t be pushed/pulled), draft (unpushed), or public (has been pushed/pulled).
[Edit because I submitted too soon:]
As for
git diff --staged
: See how the index concept forces Git to have a separate command? There isgit diff
for the working directory (but staged changes aren’t really in the working dir); andgit log -p
for any commit (but the index is not a real commit), so now you need a third invocation. If you sculpt the latest commit instead of an index,git log -p
can run on HEAD no problem. (And in my daily work I do exactly this, many times per hour.)For me, the difference is typically in when I want to remove some diff from my staged changes. It’s pretty easy to
git reset
the file and redo thegit add
. I’m not aware of a comparable “remove this file from the commit” command, so if I really need to do that, what I have to do is revert the commit, uncommit the revert-commit to get a negative diff of the commit, remove everything but the change I want to remove, then amend the commit with that change.Splitting a commit is indeed a missing feature of git rebase, but git-revise has it, and there is a PR to support specifying paths while doing so (rather than answering yes/no to every hunk).
Before that, I also used to do the double-revert dance.
darcs record
&pijul record
continue amending the current commit in this fashion.unrecord
allows “uncommiting”.Tip: You can
git commit --patch
, just likegit add --patch
. Even with multiple paths if you have. Is that what you mean? It’s not muchgit add
can do thatgit commit
can’t, and it saves a step. Only one thing – committing previously untracked files – which Gogit has now solved.I also use
--amend
a lot, or commit many small commits and then squash the related ones. Or if I have no idea to begin with, I just make one big commit at the end and iteratively split out pieces of it with git revise. Which is just as easy as staging out the same small commit from the same set of unstaged changes, with the benefit that it also works with commits that aren’t the last.I’m aware, but I usually need to
q
thegit add -p
to edit a file, so my workflow isn’t amenable to doinggit commit -p
up front.I don’t find it that big of an issue. Perhaps it is because I learned to do
git status
to orient myself, much like sysadmins habitually typels
like muscle memory.(well, I have
alias gs="git status"
, so it’s not as much typing as you’d think)Most certainly. I never (intentionally) use the staging area, which is why I have no use for git status. Other than to assert that nothing is staged before I commit. I do that if I’m unsure, but it’s not a habit, which is how it surprises me again and again.
The ls thing is a reason that you shouldn’t actually use the command line for managing novel filespaces. It’s only good for filespaces you already roughly know.
If you’re after that, you could try jujutsu
What would be the benefit of BlazingMQ over Kafka? I don’t have much experience with either, but in casual use of Kafka at my company, it seems rather fast.
For one, Kafka is not a message queue. It has different semantics.
I don’t know specifics, but having friends in the investment world, my guess is Bloomberg wanted more specific guarantees of stability and speed for the Bloomberg Terminal. I have no knowledge that says that is where this is going to be used, but reading the description immediately made me think of that product (which they label under their “Give our clients an edge” section here):
Edit: And actually, they have a page devoted to that topic here. Again, they don’t specifically call out where they might use it, but again as a nod to the Terminal application, they list these places it will shinve:
I really hate linking to Google Docs, but it’s a good text
Linking with the /mobilebasic suffix as you did is better than the default view. Although I’m also partial to /preview (i.e. this).
Seems that the results from the paper are wrong, there was an issue with the methodology: https://kenschutte.com/gzip-knn-paper/
Isn’t this one the same as https://lobste.rs/s/htsspg/low_resource_text_classification
This links to 2 short message long “discussion” on discourse, that links to a repository with almost empty readme file. Can you at least give some tl;dr or example? Otherwise, it’s a contentless post that I have to flag.
Sorry about the indirection, the main file to look at is flake.nix. There is not much documentation since this is just an experiment about Nix flakes, mostly for fun
Yeah, this is awfully niche and obtuse.
I initially thought it’s other zed and wondering what it’s there to open source.
Nix shell is IMO the only solution which comes close to sane push-button development environments. One command to install Nix itself (answer “yes” to all the prompts), then one command,
nix-shell
, to get a working environment with all the software, no matter what else is already installed on your machine (examples). As a bonus, sincenix-shell --run COMMAND
installs and runs in a single command, you can usually just run the CI build/test/install commands verbatim on your local machine, and expect them to work./rant The Python tools themselves change too often and too randomly to ever learn something that will work across projects and is still valid in five years. Last I checked
pip
still couldn’t create lockfiles with signatures, it’s got no way to tell which lockfile entries depend on which other ones (leading to ugly stuff like orphan dependencies over time and a majority of Dependabot/Renovate PRs breaking the build), and is just generally unusable.This is what I started doing recently and it’s so much less headaches.
I remember having so many frustrations with having a consistent Python environment on both my work-supplied MacBook and on a Linux server and none of the aforementioned tools met my needs, and then I replaced everything with a Nix shell and had no issues after that point. It was a minor pain to set up but entirely worth it.
nix-shell
is the outdated stuff (I think we should be looking atnix develop
at the moment) and having Nix manage your python is a bit weird.Flakes still seem like a mess, so I’ll stick with the stable
nix-shell
for the time being.Most documentation coming out now seems to be flake centric and it’s really not that messy. I find it hard to invest in a thing that’s already clear not going to be the thing.
At least some parts of the community (other than myself) have a very different view of the situation. I’m not going to name names, but check out the public community forums.
I’m tracking things relatively closely. Where would I see the alternative viewpoint/proposal?
Isn’t it fairly clear already if you see that many people publishing documentation around Nix are taking flakes as the starting point? They don’t seem to be aware that there is an alternative or they don’t think it’s a particularly good alternative.
Flakes are experimental, so
nix-shell
is not outdated.Let’s be real, flakes are going to be the thing after this period of denial is over.
I somehow find comfort when I read similar articles and they contain a section where the author outlines when they reached out to the company, how they treated his submission, and conclude the fix has been applied. This article is missing that.
“… but the servers have now been patched to make sure both seats don’t have the same account and JWT for matchmaking games” From this I gather they were responsive and, I assume, appreciative.
Their previous article did include that:
https://lobste.rs/s/lcckro/heisting_20_million_dollars_worth_magic
I initially thought this is mach-nix resurfaced, then realized this is only for python versions, not pypi packages. What is the perceived benefit of updating this repo every hour as opposed to every day?
I made this site IPv6-only in hopes of thwarting repost bots that send toxic comments my way. For an IPv4 version, see https://legacy.cyrnel.net/solarwinds-hack-lessons-learned/
Thanks. I have working IPv6 here but the v6 version doesn’t work for me - I get connection refused errors. It looks as if you have a valid AAAA record (and no A record), so my browser is trying to connect via v6, but failing.
I have the same issue on mobile, from my ipv6 network.
Could it be possible we’re experiencing the effects of peering disputes? https://adminhacks.com/broken-IPv6.html
Although probably more likely that I just didn’t configure something right…
It does look like something might not be configured right..
9:18:25 brd@m:~> curl -I https://cyrnel.net/solarwinds-hack-lessons-learned/ curl: (7) Failed to connect to cyrnel.net port 443 after 72 ms: Connection refused
9:18:27 brd@m:~> ping -6 cyrnel.net PING6(56=40+8+8 bytes) 2602:b8:xxxxxx --> 2603:6081:ae40:160::abcd 16 bytes from 2603:6081:ae40:160::abcd, icmp_seq=0 hlim=52 time=69.674 ms 16 bytes from 2603:6081:ae40:160::abcd, icmp_seq=1 hlim=52 time=72.403 ms 16 bytes from 2603:6081:ae40:160::abcd, icmp_seq=2 hlim=52 time=72.432 ms 16 bytes from 2603:6081:ae40:160::abcd, icmp_seq=3 hlim=52 time=71.435 ms
HTH, HAND!
Ah I only opened port 80 in the firewall for the IPv6 version. Should be fixed for next time, thanks for the debugging help!
Works for me, thanks!
yeah same, I can ping it, but I can’t curl or browse it (handy and PC) - other ipv6 stuff works fine
Thanks! My ISP (local branch of My Republic) has zero plans/intent to implement IPv6, so it’s appreciated that there’s a legacy method. :)
Hm I didn’t realize “Scalar” means “Code Point but not any of the surrogate values”
Nice link to the Unicode glossary !
I wonder what the cleanest/shortest JSON string (with surrogates) -> UTF-8 encoded text decoder looks like?
I’m about to write one of those :)
This linked article is also great:
Summary:
So for https://www.oilshell.org,
${#s}
evaluates to 5 following bash, and I agree that’s problematic / not useful.But
len(s)
is just 17, following Python 2, Rust, Go. And then you can have libraries to do the other calculations.I’ve had a draft post sitting in a folder for years that’s titled “yes it is wrong” and making the argument that for many people who just need to work with text, code units are the least useful possible abstraction to use.
And in general, I think high-level languages should not be exposing bytes or bytes-oriented abstractions for text to begin with. There are cases for doing it in low-level languages where converting to a more useful abstraction might have too much overhead for the desired performance characteristics, but for high-level languages it should be code points or graphemes, period.
And don’t get me started on the people who think operations like length and iteration should just be outright forbidden. Those operations are too necessary and useful for huge numbers of real-world tasks (like, say, basically all input validation in every web app ever, since all data arriving to a web application is initially in string form).
(and one final note – I think 5 is a perfectly cromulent length to return for that emoji, because it makes people learn how emoji work, just as a flag emoji should return a length > 1 to show people that it’s really composed from a country code, etc.)
Why does input validation care about grapheme length? It seems like the most irrelevant type of length.
Graphemes are the closest thing to how humans (or at least the majority of humans who are not Unicode experts) normally/“naturally”/informally think about text. So in an ideal world, the most humane user interface would state “max length n” and mean n grapheme clusters.
For some Western European scripts, code points also get kinda-sorta-close-ish to that, and are already the atomic units of Unicode, so also are an acceptable way to do things (and in fact some web standards are defined in terms of code point length and code point indexing, while others unfortunately have JavaScript baggage and are instead defined in terms of UTF-16 code units).
But why do you have a max length validation that is close to anything people would count themselves at all? In my experience, there’s two kinds of max length requirements: 1) dictated by outside factors, where you do what they do regardless of what you think about it or 2) to prevent people from like uploading movies as their username, in which case you can set it to something unreasonably high for the context (e.g. 1 kilobyte for a name) and then the exact details aren’t super important anyway.
Ask Twitter, which in the old days had a surprisingly complex way to account for the meaning of “140 characters”.
But more seriously: there are tons of data types out there in the wild with validation rules that involve length, indexing, etc., and it is more humane – when possible – to perform that validation using the same way of thinking about text as the user.
If you can’t see why “enter up to 10 characters” and then erroring and saying “haha actually we meant this weird tech-geek thing you’ve never heard of” is a bad experience for the user, I just don’t know what to say. And doing byte-oriented or code-unit-oriented validation leads to that experience much more often.
The real go-to example is the old meme of “deleting” the family members one by one when backspacing on an emoji. That shows the disconnect and also shows really bd UX.
I completely agree. To me, it’s about providing a string abstraction with a narrow waist. If you expose code units, then you are favouring a single encoding for no good reason. I want the string interface to make it easy for things above and below it.
Implementations of the interface should be able to optimise for storage space, for seek speed, for insertion speed, or any other constraint (or combination of constraints) that applies. Nothing using the interface should see anything other than performance changes when you change this representation. Exposing Unicode code points is a good idea here because every encoding has a simple transformation to and from code points. If you expose something like UTF-16 code units then a UTF-8 encoding has to expand and then recompress, which is more work. Iteration interfaces should not assume contiguous storage (or even that consumers have raw access to the underlying storage), but they should permit fast paths where this is possible. ICU’s UText is a good example of doing this well (and would be between with a type system that allowed monomorphisation), it gives access to contiguous buffers that might be storage provided by the caller (and can be on stack) and callbacks to update the buffer. If your data is a single contiguous chunk, it’s very fast, if it isn’t then it gracefully decays.
Similarly, it should be possible to build abstractions like grapheme clusters, Unicode word breaking, and so on, on top without these implementations knowing anything about the underlying storage. If you need them, they should be available even for your custom string type.
UTF-8 is the only encoding that will give you a narrow waist :) That was one of the points of this blog post and hourglass diagrams:
https://www.oilshell.org/blog/2022/02/diagrams.html#bytes-flat-files
Just to clarify since I quoted that article, I would say that for a shell, it IS wrong that len(facepalm) is 7, but it’s ALSO wrong that len(facepalm) is 5.
For a shell, it should be 17, and of course you can have other functions on top of UTF-8 to compute what you want. You can transform UTF-8 to an array of code points.
One main reason is that the Unix file system and APIs traffic in bytes, and other encodings are not single encodings – they have a notion of byte order.
There is no room for a BOM in most of the APIs.
There is also no room for encoding errors. For example, in bash the string length ${#s} is a non-monotonic function of byte length because it gives nonsense values for invalid unicode. Basically you can have a string of 5 chars, and add 3 invalid bytes to it, and get 8. Then when you add the 4th valid byte, the length becomes 6.
If your language is meant to deal with HTTP which has a notion of default encoding and encoding metadata, a different encoding may be better. But Unix simply lacks that metadata, and thus all implementations that attempt to be encoding-aware are full of bugs.
The other huge reason is that UTF-8 is backward compatible with ASCII, and UTF-16, UTF-32 aren’t. It gracefully degrades.
A perfect example of this is that a Japanese developer of Oil just hit a bug in Oil’s Python 2 scripts, with LANG=*utf8, because Python 2’s default encoding is ASCII. The underlying cause was GNU
$(date)
producing a unicode character in Japanese.Python 3 doesn’t fix these things – it took 6 years to choose a default file system encoding for Python 3 on all platforms, which is because it’s inherently an incoherent concept. You can have a single file system with 2 files with 2 different encodings, or 10 files with 10 different encodings.
So the default behavior should be to pass through bytes, not guess an encoding and then encode/decode EVERY string. That’s also slow for many applications, like a build system.
tl;dr A narrow waist provides interoperability, and UTF-8 is more interoperable than UTF-16 or UTF-32. Unix has no space for encoding metadata. Go-style Unicode works better for a shell than Python style (either Python 2 or 3).
I think you’re missing my point. If the encoding is exposed, you’ve built a leaky abstraction. For a shell, you are constrained by whatever other programs produce (and that will typically be controlled by the current locale, though typically UTF-8 these days, at least for folks using alphabetic languages), but that’s a special case.
There is no abstraction, so it’s not a leaky one. Shell is not really about building “high” abstractions, but seamless composition and interoperability.
The leaky abstraction is the array of code points! The leaks occur because you have the problems of “where do I declare the encoding?” and “what do I do with decode errors?”, when 99% of shell programs simply pass through data like this:
The shell reads directly from glob() and sends it to execve(). No notion of encoding is necessary. It’s also pointless to encode and decode there – it would simply introduce bugs where there were none.
Are you arguing for a Windows style or Python style where the file system returns unicode characters? I’d be very surprised by that, but I don’t have time to get into it here
Maybe in a blog post – the 2 above illuminated the issues very thoroughly
The article explains why it’s not practical to expose something that’s not bytes, utf-16 code units, or utf-32 code units / scalars:
Basically because it’s not really a property of the language, but of the operating system.
So I’ll flip it around and say: instead of publishing why “it is wrong”, publish what’s right and we’ll critique it :-)
I already said:
(and of the two I prefer code points as the base abstraction, but would not object to exposing a grapheme-oriented API on top of that)
I strongly disagree, read this series of 7 blog posts by Victor Stinner to see what problems this choice caused Python:
https://vstinner.github.io/painful-history-python-filesystem-encoding.html
(This is because the file system encoding is an incoherent concept, at least on Unix. libc LANG= LC_CTYPE= is also an incoherent design. )
https://vstinner.github.io/python37-new-utf8-mode.html -
Stinner is a hero for this …
The only reasonable argument for Python’s design is “Windows”, and historically that made sense, but I think even Windows is moving to UTF-8.
The number of bugs produced is simply staggering, not to mention the entire Python 2 to 3 transition.
A Go-style UTF-8 design is not just more efficient, but has fewer bugs. Based on my experience, the kinds of bugs you listed are theoretical, while there are dozens of real bugs caused by the leaky abstraction of an array of code points, which real applications essentially never use. LIBRARIES like case folding use code points; applications generally don’t. Display libraries that talk to the OS use graphemes; applications generally don’t.
It is difficult to reconcile your praise for “Go’s UTF-8 design” with your leaning so heavily on the specific criticism of handling filesystem encoding.
Assuming that everything will be, or can be treated as, UTF-8 is incredibly dangerous and effectively invites huge numbers of bugs and broken behaviors. This is why Rust – which otherwise is a “just use UTF-8” language – has a completely separate type for representing system-native “strings” – like filesystem paths – which might well turn out to be undecodable junk.
Also, I know of you primarily as someone who works on a shell, and so to you Python 3 probably did feel like a big step backwards, since I’m sure Python 2 felt a lot easier to you.
But to me the old Python 2 way was the bad old broken Unix way. Which is to say, completely and utterly unsafe and unfit for any purpose other than maybe some types of small toy scripts that will only ever exist in the closed ecosystem of their developer’s machine(s).
Basically everybody in the world outside of the niche of Unix-y scripting, as the very first thing they would do on picking up Python 2, had to go laboriously re-invent proper text handling and encoding/decoding (the “Unicode sandwich” model, as we used to call it), and live in fear that they might have made a mistake somewhere and would have a pager go off at 2AM because a byte “string” had accidentally gotten through all the defenses.
This was, if my tone hasn’t made it abundantly clear already, not a pleasant way to have to do things, and Python 2 only got to be pleasant (or somewhat pleasant) for many users because smart and patient people did all the tedious work of dealing with its busted broken approach to strings.
Python 3 in comparison is a massive step forward and a huge breath of fresh air. Does it probably make life seem more complex for you, personally, and for other people like you? Sure, though I’d argue that it only ever felt simpler in the past because you were operating in a niche where the complexity didn’t rise up and bite you as often or as obviously as the way it did me and people working in other domains, in large part due to historical Unix traditions mostly avoiding doing things with text beyond ASCII or, in a few favored places, perhaps latin-1.
But even if it did just make your life outright more difficult, I think it would be a worthwhile tradeoff; you’re dealing with a domain that really is complex and difficult, and it should be on you to find solutions for that complexity and difficulty, not on everyone else to deal with a language and an approach to strings that makes our lives vastly more difficult in order to pretend to make yours a bit easier. The old way, at best, kinda sorta swept the complexity (for you, not for me) under the rug.
No, the idea is not assuming everything is UTF-8. If you get an HTTP request that declares UTF-16, then you convert it to UTF-8, which is trivial with a library.
When you have a channel that doesn’t declare an encoding, you can treat it as bytes, and UTF-8 text will work transparently with operations on bytes. You can search for a
/
or a:
within bytes or UTF-8; it doesn’t matter. You don’t need to know the encoding.So the idea is that UTF-8 is the native representation, and you can either convert or not convert at the edges. The conversion logic belongs in the app / framework / libraries, not in the language and standard I/O itself.
That is how UTF-8 is designed; I don’t think you’re aware of that.
We’ll just have to leave this alone, but I don’t believe you’re speaking from experience. Everything you’ve brought up is theoretical (“this is how humans think, what humans want”), while I’ve brought up real bugs.
You’re also mistaking the user of the program with the programmer. UTF-8 is easy to understand for programmers, and easy to write libraries for – it’s not dangerous. The idea that UTF-8 is “incredibly dangerous” is ignorant, and you didn’t show any examples of “huge numbers of bugs and broken behaviors”.
The reality is the exact opposite – see Stinner’s series of 7 posts: https://vstinner.github.io/python37-new-utf8-mode.html
Except UTF-8 does not “work transparently” like this, because at best there’s an extremely limited subset of operations you can do that won’t blow up. And that sort of sinks the entire idea that this thing is actually a “string” or “text” – if you’re going to call it that, people will expect to be able to do more operations than just “did this byte value occur somewhere in there”.
Inside a program in a high-level language, the number of cases in which you should be working directly with a byte array or code-unit array and calling it “text” or “string” rounds to zero. At most, you could argue that Python should have gone the Rust route with a separate and explicitly not-text type for filesystem paths, But that still wouldn’t be an argument for having the string type be something that’s not a string (and byte arrays are not strings and code-unit arrays are not strings, no matter how much someone might want them to be).
This is uncharitable to such an extreme degree that it is effectively just an insult.
But: Python 2’s “strings” were awful. They were unfit for purpose. They were a constant source of actual real genuine bugs. Claiming that they weren’t, or that Python 3 didn’t solve huge swathes of that all in one go by introducing a distinction between actual real strings and mere byte sequences, makes no sense to me because I and many others lived through that awfulness.
And recall that Python 3 – like Python 2 before it – initially did adopt code units as the atomic unit of its Unicode strings. And that still was a source of messiness and bugs, since it depended on compile-time flags to the interpreter and meant some code simply wasn’t portable between installations of Python, even of the exact same major.minor.patch version. It wasn’t until Python 3.3 that this changed and the Python string abstraction actually became a sequence of code points.
So if anyone needs to go re-read some Python history it’s you, since you seem to be thinking that something which wasn’t the case until Python 3.3 (string as sequence of code points) was responsible for trouble that dated back to prior versions (like I said, your argument really is not about the atomic unit of the string abstraction, it’s about whether filesystem paths ought to be handled as strings).
5
(codepoints) is sensible in exactly one case: communicating offsets for string replacement/annotation/formatting in a language and unicode version and encoding agnostic way.Although if you convert “🤦🏼♂️” to
[]rune
in Go, the length is 1 (which makes sense to me.)Did you try it? It should be 5 runes because a rune is a codepoint.
I did try it but now I’ve tried it again, it’s 5 as you say. Which makes me wonder what I did wrong in the first attempt!
Maybe you had a plain farmer and not black, male farmer.
I’m away from a computer, but I wonder what’s Perl’s take on this? Perl always has great support for Unicode.