Worth pondering … Fish is now over 18 years old, so why is there an article being written in July 2023 about these “Unix shell” pitfalls, and why does it draw discussion?
For some context, fish was first released in February 2005, and git was first released in April 2005 (according to Wikipedia). The computing world was a pretty different place before git existed
(I have my own answers, but I mean this literally – it is worth pondering :) Partly about fish, but also rc shell, etc. which is at least 10 years older than fish)
I used rc and es for like ten years, and I feel like I learned exactly why we’re all still writing sh. It’s not only legacy reasons, although it’s definitely sometimes legacy reasons. The more important thing, I think, is that it’s pretty clear what sh does wrong, but a lot less obvious what it does right. Most “sh done right” attempts fix the wrong things, and break everything else. rc is an excellent example of this. Looking at it having just done some shell programming, you’d expect it to be a breath of fresh air, but in practice it’s just much more annoying for almost anything imaginable.
Oh very interesting – this is actually a response I haven’t heard before! Some people say “use ${ancient_shell}” and they haven’t used it (although some obviously have).
What does RC break? And what specifically caused you to move off it after 10 years?
(From what i can see, in RC, everything is still a string, which I don’t think is good. In fish shell, everything is an array of strings, but I don’t think that’s good either. You really want multiple data structures)
(Also it seems weird that there are certainly some RC users, but there is virtually no active RC development as far as I know? I could be wrong)
it’s pretty clear what sh does wrong, but a lot less obvious what it does right.
Oh yes, very much agreed! There is a core of goodness to shell, that is not easy to explain, but IMO glaringly exists. The evidence is people getting so much work done with it, despite all the flaws
If I had to sum it up, I’d say it’s “infinite composition and extensibility” and “universal glue”. As well as unifying a programming language and a minimal/fast UI.
From memory, and keep in mind I did switch to es at some point so the two blur together in my mind:
The really big thing I missed was sh’s string handling (appropriately enough) - interpolation, understanding more than one kind of quotation mark (I know it sounds silly), and %, %% and friends. Other things that regularly annoyed me were:
Getting the output of a command into a variable without word splitting it is fiddly
Reading input is clumsy
Checking if a variable is empty is annoying
Confusing syntax - control structures like for seem like they should take a compound command, but actually they take any command, including a pipeline. So {for (...) {...}} | ... is common. Just… why?
Syntax is always syntax in rc, while sh-alikes have some syntax that is only interpreted where it would make sense (e.g. { is only special when it’s in a place a command could be)
As time wore on, the hipster appeal of rc wasn’t strong enough to make me spend time figuring out how to do things in it, so increasingly I’d end up just starting bash whenever I wanted to do anything (it should probably say something that after many years of daily use I still had to stop and figure out how to do things). And at that point why not just live in bash?
Years ago I thought the lack of restrictions were a sign of simple and clean design to be held up as a badge of honor compared to more limited operating systems. Now that I am responsible for production shell scripts I am a firm supporter of your view that filenames should be UTF-8 with no control characters.
All of UTF-8 seems way too much to me. Call me an authoritarian, but my stance is that files should have been required to start with an alphanum ASCII character, and include only those along with dot, hyphen, and underscore.
I mean… there are entire countries of people who don’t use the latin script. The proposal seems strict but perhaps workable for English speakers, but it’s neither fair nor workable in a global context.
Just like DNS, I don’t think anyone is stopping people from using the idn rules and punycoding generally for a different display of filenames. Honestly, one could just use gibberish and set a xattr of display_name= for all of your non ASCII or latin-1 needs.
But in all honesty, I’d just rather keep the status quo of near full 8-bit clean (NUL and / being the only restrictions) file names. Working restrictions just don’t seem worth the effort.
What’s the motivation for allowing control characters as legal in Unix filenames? I struggle to see the inclusion of newline in a filename as anything other than a nuisance or the result of user error.
I mean, Microsoft’s old standard of 8.3 filenames was technically simple. I don’t understand why it’s not the most prevalent file naming convention today!
Eight characters isn’t long enough to create a meaningful title. 255 alphanumeric characters is.
I have softened a bit on this since yesterday. French speakers can pretty much always manage using the Latin alphabet sans diacritics, but more distant languages might actually need their local character set.
Still tho, the complexity difference between ASCII-only and UTF-8 is énorme, as /u/david_chisnall eloquently expressed in another comment. For a “Westerner” it would be a high price to pay for features that I don’t need.
more distant languages might actually need their local character set
How extraordinarily generous of you.
For a “Westerner” it would be a high price to pay for features that I don’t need.
Your own blog uses UTF-8, so this sounds a bit hypocritical to me.
UTF-8/Unicode is complex, true, but the purpose of software development is taming complexity. Denying billions of people the ability to create filenames in the script of their choosing in order to make a few software engineer’s life a bit easier is abdicating that responsibility.
The problem is the cases where the correct handling is unclear. 7-bit ASCII is unambiguous but also unhelpful. 8-bit character encodings are simple but the meaning of the top half of the character set depends on some out of-band metadata. If you use them and switch between, say, a French and Greek locale, they will be displayed (and sorted) differently.
Once we get to Unicode, things get far worse. First, you need to define some serialisation. UTF-32 avoids some of the problems but has two big drawbacks: It requires 4 bytes per character (other encodings average 1-3) and it is not backwards compatible with anything. On *NIX, the APIs were all defined in terms of ASCII, on Windows they were defined for UCS-2, so these platforms want an encoding that allows the same width code unit to be used. This is where the problems start. Not every sequence of 8-bit values is valid UTF-8 and not every sequence of 2-byte values is a valid UTF-16 string. This means, at least, the kernel must do some validation. This isn’t too painful but it does cause some extra effort in the kernel. It’s worse when you consider removable media: what should you do if you encounter a filesystem written on a system that treats file names as 8-bit code page names and contains files with invalid UTF-8?
Beyond that, you discover that Unicode has multiple ways of representing the same human-readable text. For example, accented characters can often be represented as a single code point or as a base character and an accent combining diacritic. If I create a file in one form and try to open it with the other form, what should happen? For extra fun, input methods in different apps may give different forms and if you type the text with one form into an app that does canonicalisation then copy it you will get the other form, so copy and paste of file name might be broken if you treat filenames as a bag of bytes.
If you want to handle these non-canonical cases then you need to handle canonicalisation of any path string coming into the kernel. The Unicode canonicalisation algorithm is non-trivial, so that’s a bunch more ring 0 code. Worse, newer versions of the standard add new rules, so you need to keep it up to date and you may encounter problems where a filesystem is created with a newer kernel and opened on an older one.
It’s not totally clear to me how much of this complexity belongs in the filesystem and how much in the VFS layer. Things like canonicalisation of paths feels like it should be filesystem-independent. In particular, in something with a UNIX filesystem model, a single path can cross many different filesystems (junctions on NTFS make this possible on Windows too) and so there needs to be a canonical concept of a path that filesystem drivers belong.
Oh, sure, it’s going to be more complex. But that to me is pretty much the paradigmatic example of “Worse is Better”, and I tend to live on the “better” side of that divide.
Yeah it’s funny to discover things like “git allows whatever the shell does, except even more so” (AFAIK the only limitation Git itself puts on tree entry names is that they don’t contain a NUL byte, they can even contain a / though I’ve not checked what that does yet).
It’s hard though because word evaluation is one of the nastiest parts of shell. But we implemented it faithfully and provide an upgrade to a better behavior.
(zsh and fish also do better, but they’re less compatible with POSIX/bash in other ways, especially fish.)
This is the #1 reason I enjoy using the plan 9 rc shell[1] for scripts. There’s exactly one place where word splitting happens: at the point where command output is evaluated. And there, it’s trivial to use any character you want:
x = `{echo hi there: $user} # evalutates to list ('hi' 'there:' 'ori')
y = `:{echo hi there: $user} # evalutates to list ('hi there' ' ori')
There’s no other word splitting, so:
args = ('a' 'b c' 'd e f')
echo $#args
echo $args(3)
will print:
3
d e f
The shell itself is pleasantly simple; there’s not much to learn[2]. And while it’s not fun for interactive use on unix because it offloads too much of the interactive pleasantness to the plan 9 window system (rio), it’s still great for scripting.
The existence of find -print0 is an indictment of everything that is wrong about traditional Unix shell parsing. But it’s what we’re stuck with. Shells weren’t designed to work with filenames with whitespace.
This article doesn’t even get into the horrors that happen if you try to use rsync or ssh in combination with spaces in filenames.
I like how he zeroed in on the whitespace quote thing, like that’s the only thing wrong with shell script. lol
The workaround that I have for the lastdl thing is:
Don’t write lastdl in shell script (which he does, although he wrote it in Python, which…. also sucks, but not nearly as bad as shell script)
Have a custom piping mechanism within the lastdl script so you don’t have to use "$(pleasekillme)", i.e. something like lastdl -x rm
Of course this falls apart in so many use cases but for the one you are using it for 99% of the time (append the filename to the end of a command) it works pretty well if you ask me. That is probably the same exact argument Bourne had. He was prolly like “none of the files on my hard drive have whitespace, it doesn’t matter”
IF you know you have GNU tools, a simpler workaround than the article’s proffer would seem to be:
ls -t --zero | head -zn1 | xargs -0 echo rm -f --
which feels in line with your suggestion. This doesn’t argue against the syntax complaints, of course, since it just moves everything out of shell language except “pipeline syntax”. { echo just to be safe before actually running it.. :-) }
If you scroll up to my top-level comment, I linked a more general article about risky things that are legal to do in Unix/POSIX filenames, and it’s full of examples of “oh, why not just do this” that don’t actually work (including some “well we’ll just do it in Python to avoid shell weirdness” and still having things break).
This is something that Fish gets right, among other things.
https://fishshell.com/docs/current/fish_for_bash_users.html#variables
Worth pondering … Fish is now over 18 years old, so why is there an article being written in July 2023 about these “Unix shell” pitfalls, and why does it draw discussion?
For some context, fish was first released in February 2005, and git was first released in April 2005 (according to Wikipedia). The computing world was a pretty different place before git existed
(I have my own answers, but I mean this literally – it is worth pondering :) Partly about fish, but also rc shell, etc. which is at least 10 years older than fish)
I used rc and es for like ten years, and I feel like I learned exactly why we’re all still writing sh. It’s not only legacy reasons, although it’s definitely sometimes legacy reasons. The more important thing, I think, is that it’s pretty clear what sh does wrong, but a lot less obvious what it does right. Most “sh done right” attempts fix the wrong things, and break everything else. rc is an excellent example of this. Looking at it having just done some shell programming, you’d expect it to be a breath of fresh air, but in practice it’s just much more annoying for almost anything imaginable.
Oh very interesting – this is actually a response I haven’t heard before! Some people say “use ${ancient_shell}” and they haven’t used it (although some obviously have).
What does RC break? And what specifically caused you to move off it after 10 years?
(From what i can see, in RC, everything is still a string, which I don’t think is good. In fish shell, everything is an array of strings, but I don’t think that’s good either. You really want multiple data structures)
(Also it seems weird that there are certainly some RC users, but there is virtually no active RC development as far as I know? I could be wrong)
Oh yes, very much agreed! There is a core of goodness to shell, that is not easy to explain, but IMO glaringly exists. The evidence is people getting so much work done with it, despite all the flaws
I spilled many words on this! https://www.oilshell.org/blog/tags.html?tag=shell-the-good-parts#shell-the-good-parts
If I had to sum it up, I’d say it’s “infinite composition and extensibility” and “universal glue”. As well as unifying a programming language and a minimal/fast UI.
From memory, and keep in mind I did switch to es at some point so the two blur together in my mind:
The really big thing I missed was sh’s string handling (appropriately enough) - interpolation, understanding more than one kind of quotation mark (I know it sounds silly), and %, %% and friends. Other things that regularly annoyed me were:
for
seem like they should take a compound command, but actually they take any command, including a pipeline. So{for (...) {...}} | ...
is common. Just… why?{
is only special when it’s in a place a command could be)As time wore on, the hipster appeal of rc wasn’t strong enough to make me spend time figuring out how to do things in it, so increasingly I’d end up just starting bash whenever I wanted to do anything (it should probably say something that after many years of daily use I still had to stop and figure out how to do things). And at that point why not just live in bash?
At $WORK, we only have access to sh, bash, and for MacOS users, zsh.
Oh are these managed dev machines?
I think production environments have long been locked down, but it seems like there is a trend for dev environments too
I’m strawmanning a little here but I wouldn’t say it gets it right. Just yesterday I got bitten by this:
IMO, it makes more sense to provide an opt-in mechanism into expanding variable values as options rather than make the unsafe way the default.
ZSH was onto the right track with its
${=foo}
syntax, except that it was meant for whitespaces.It’s not just the shell and it’s not just whitespace; Unix/POSIX file paths are full of dangers.
Thanks for that! Near the top there’s this quote:
All of UTF-8 seems way too much to me. Call me an authoritarian, but my stance is that files should have been required to start with an alphanum ASCII character, and include only those along with dot, hyphen, and underscore.
And I’m not sure about the hyphen.
I mean… there are entire countries of people who don’t use the latin script. The proposal seems strict but perhaps workable for English speakers, but it’s neither fair nor workable in a global context.
Just like DNS, I don’t think anyone is stopping people from using the idn rules and punycoding generally for a different display of filenames. Honestly, one could just use gibberish and set a xattr of display_name= for all of your non ASCII or latin-1 needs.
But in all honesty, I’d just rather keep the status quo of near full 8-bit clean (NUL and / being the only restrictions) file names. Working restrictions just don’t seem worth the effort.
What’s the motivation for allowing control characters as legal in Unix filenames? I struggle to see the inclusion of newline in a filename as anything other than a nuisance or the result of user error.
“Worse is better”. It’s less that they’re allowed, and more that only the nul byte and the ascii solidus (0x2F) are forbidden.
And frankly I would not be surprised if dir level API let you create an entry with a / in it.
That would seem to fall down for any use case such as “I’ve got three files beginning with this native character I want to match with a glob”
We can deal with it just fine in exchange for the technical simplicity.
I mean, Microsoft’s old standard of 8.3 filenames was technically simple. I don’t understand why it’s not the most prevalent file naming convention today!
Eight characters isn’t long enough to create a meaningful title. 255 alphanumeric characters is.
I have softened a bit on this since yesterday. French speakers can pretty much always manage using the Latin alphabet sans diacritics, but more distant languages might actually need their local character set.
Still tho, the complexity difference between ASCII-only and UTF-8 is énorme, as /u/david_chisnall eloquently expressed in another comment. For a “Westerner” it would be a high price to pay for features that I don’t need.
Because “Westerners” are the only ones that matter in computing?
More like: I’m the only one who matters when I’m not being paid to write code.
How extraordinarily generous of you.
Your own blog uses UTF-8, so this sounds a bit hypocritical to me.
UTF-8/Unicode is complex, true, but the purpose of software development is taming complexity. Denying billions of people the ability to create filenames in the script of their choosing in order to make a few software engineer’s life a bit easier is abdicating that responsibility.
Of course it does. Text needs to work in every language.
But here we’re talking about filesystems and filenames; the contents of the files are arbitrary sequences of bytes and also not the point?
ASCII-only is a non-starter. If you tell people “the system forbids you using your own preferred language”, they will go use some other system.
File names belong to the user. If the system can’t handle them, that’s a defect, full stop.
The problem is the cases where the correct handling is unclear. 7-bit ASCII is unambiguous but also unhelpful. 8-bit character encodings are simple but the meaning of the top half of the character set depends on some out of-band metadata. If you use them and switch between, say, a French and Greek locale, they will be displayed (and sorted) differently.
Once we get to Unicode, things get far worse. First, you need to define some serialisation. UTF-32 avoids some of the problems but has two big drawbacks: It requires 4 bytes per character (other encodings average 1-3) and it is not backwards compatible with anything. On *NIX, the APIs were all defined in terms of ASCII, on Windows they were defined for UCS-2, so these platforms want an encoding that allows the same width code unit to be used. This is where the problems start. Not every sequence of 8-bit values is valid UTF-8 and not every sequence of 2-byte values is a valid UTF-16 string. This means, at least, the kernel must do some validation. This isn’t too painful but it does cause some extra effort in the kernel. It’s worse when you consider removable media: what should you do if you encounter a filesystem written on a system that treats file names as 8-bit code page names and contains files with invalid UTF-8?
Beyond that, you discover that Unicode has multiple ways of representing the same human-readable text. For example, accented characters can often be represented as a single code point or as a base character and an accent combining diacritic. If I create a file in one form and try to open it with the other form, what should happen? For extra fun, input methods in different apps may give different forms and if you type the text with one form into an app that does canonicalisation then copy it you will get the other form, so copy and paste of file name might be broken if you treat filenames as a bag of bytes.
If you want to handle these non-canonical cases then you need to handle canonicalisation of any path string coming into the kernel. The Unicode canonicalisation algorithm is non-trivial, so that’s a bunch more ring 0 code. Worse, newer versions of the standard add new rules, so you need to keep it up to date and you may encounter problems where a filesystem is created with a newer kernel and opened on an older one.
All I will say is that this seems more to point out that it is filesystem dependent and that filesystems are too complex to run in kernel as ring 0.
Something we already knew but it adds one more reason…
It’s not totally clear to me how much of this complexity belongs in the filesystem and how much in the VFS layer. Things like canonicalisation of paths feels like it should be filesystem-independent. In particular, in something with a UNIX filesystem model, a single path can cross many different filesystems (junctions on NTFS make this possible on Windows too) and so there needs to be a canonical concept of a path that filesystem drivers belong.
Oh, sure, it’s going to be more complex. But that to me is pretty much the paradigmatic example of “Worse is Better”, and I tend to live on the “better” side of that divide.
Yeah it’s funny to discover things like “git allows whatever the shell does, except even more so” (AFAIK the only limitation Git itself puts on tree entry names is that they don’t contain a NUL byte, they can even contain a
/
though I’ve not checked what that does yet).Oils fixes this, with a single line at the top:
test script:
Without the upgrade line, you get the same errors that the author complains about – it mangles the filenames.
With the ugprade, you get the files in the
tmp
dir as you expect.This is called “Simple Word Evaluation” and actually any shell can implement it – I documented it more than 3 yeras ago!
Simple Word Evaluation in Unix Shell
Reminder: Oil Doesn’t Require Quoting Everywhere (2021)
It’s hard though because word evaluation is one of the nastiest parts of shell. But we implemented it faithfully and provide an upgrade to a better behavior.
(zsh and fish also do better, but they’re less compatible with POSIX/bash in other ways, especially fish.)
This is the #1 reason I enjoy using the plan 9 rc shell[1] for scripts. There’s exactly one place where word splitting happens: at the point where command output is evaluated. And there, it’s trivial to use any character you want:
There’s no other word splitting, so:
will print:
The shell itself is pleasantly simple; there’s not much to learn[2]. And while it’s not fun for interactive use on unix because it offloads too much of the interactive pleasantness to the plan 9 window system (rio), it’s still great for scripting.
[1] http://shithub.us/cinap_lenrek/rc/HEAD/info.html
[2] http://man.9front.org/1/rc
Glad to see someone else has seen the light! Though I prefer byron’s rc as it fixes many ergonomic issues with the plan9 rc.
Thankfully shellcheck is a thing nowadays.
The existence of
find -print0
is an indictment of everything that is wrong about traditional Unix shell parsing. But it’s what we’re stuck with. Shells weren’t designed to work with filenames with whitespace.This article doesn’t even get into the horrors that happen if you try to use rsync or ssh in combination with spaces in filenames.
We live in a fallen world.
I like how he zeroed in on the whitespace quote thing, like that’s the only thing wrong with shell script. lol
The workaround that I have for the
lastdl
thing is:lastdl
in shell script (which he does, although he wrote it in Python, which…. also sucks, but not nearly as bad as shell script)lastdl
script so you don’t have to use"$(pleasekillme)"
, i.e. something likelastdl -x rm
Of course this falls apart in so many use cases but for the one you are using it for 99% of the time (append the filename to the end of a command) it works pretty well if you ask me. That is probably the same exact argument Bourne had. He was prolly like “none of the files on my hard drive have whitespace, it doesn’t matter”
IF you know you have GNU tools, a simpler workaround than the article’s proffer would seem to be:
which feels in line with your suggestion. This doesn’t argue against the syntax complaints, of course, since it just moves everything out of shell language except “pipeline syntax”. {
echo
just to be safe before actually running it.. :-) }Just use rc(1)
Which
rc
? The original Plan9 implementation or Drew DeVault’s version?I use Rakitzis port: https://github.com/rakitzis/rc
What about:
It may allow to completly drop the set_defaults_args function :)
While I’m here, most_recent_file_in_directory could probably be rewritten as this oneliner:
Beware I’m on my phone, I tested nothing :p
If you scroll up to my top-level comment, I linked a more general article about risky things that are legal to do in Unix/POSIX filenames, and it’s full of examples of “oh, why not just do this” that don’t actually work (including some “well we’ll just do it in Python to avoid shell weirdness” and still having things break).
What about
ls | xargs
et al?