Eh, there are some problems with xargs, but this isn’t a good critique. First off it proposes a a “solution” that doesn’t even handle spaces in filenames (much less say newlines):
rm $(ls | grep foo)
I prefer this as a practical solution (that handles every char except newlines in filenames):
ls | grep foo | xargs -d $'\n' -- rm
You can also pipe find . -print0 to xargs -0 if you want to handle newlines (untrusted data).
(Although then you have the problem that there’s no grep -0, which is why Oil has QSN. grep still works on QSN, and QSN can represent every string, even those with NULs!)
One nice thing about xargs is that you can preview the commands by adding ‘echo’ on the front:
ls | grep foo | xargs -d $'\n' -- echo rm
That will help get the tokenization right, so you don’t feed the wrong thing into the commands!
I never use xargs -L, and I sometimes use xargs -I {} for simple invocations. But even better than that is using xargs with the $0 Dispatch pattern, which I still need properly write about.
Basically instead of the mini language of -I {}, just use shell by recursively invoking shell functions. I use this all the time, e.g. all over Oil and elsewhere.
do_one() {
# It's more flexible to use a function with $1 instead of -I {}
echo "Do something with $1"
echo mv $1 /tmp
}
do_all() {
# call the do_one function for each item. Also add -P to make it parallel
cat tasks.txt | grep foo | xargs -n 1 -d $'\n' -- $0 do_one
}
"$@" # dispatch on $0; or use 'runproc' in Oil
Now run with
myscript.sh do_all, or
my_script.sh do_one to test out the “work” function (very handy! you need to make this work first)
This separates the problem nicely – make it work on one thing, and then figure out which things to run it on. When you combine them, they WILL work, unlike the “sed into bash” solution.
Reading up on what xargs -L does, I have avoided it because it’s a custom mini-language. It says that trailing blanks cause line continuations. Those sort of rules are silly to me.
I also avoid -I {} because it’s a custom mini-language.
IMO it’s better to just use the shell, and one of these three invocations:
xargs – when you know your input is “words” like myhost otherhost
xargs -d $'\n' – when you want lines
xargs -0 – when you want to handle untrusted data (e.g. someone putting a newline in a filename)
Those 3 can be combined with -n 1 or -n 42, and they will do the desired grouping. I’ve never needed anything more than that.
So yes xargs is weird, but I don’t agree with the author’s suggestions. sed piped into bash means that you’re manipulating bash code with sed, which is almost impossible to do correctly.
Instead I suggest combining xargs and shell, because xargs works with arguments and not strings. You can make that correct and reason about what it doesn’t handle (newlines, etc.)
It can be much faster (depending on the use case). If you’re trying to rm 100,000 files, you can start one process instead of 100,000 processes! (the max number of args to a process on Linux is something like 131K as far as I remember).
It’s basically
rm one two three
vs.
rm one
rm two
rm three
Here’s a comparison showing that find -exec is slower:
Oh yes, it does! I don’t tend to use it, since I use xargs for a bunch of other stuff too, but that will also work. Looks like busybox supports it to in addition to GNU (I would guess it’s in POSIX).
the max number of args to a process on Linux is something like 131K as far as I remember
Time for the other really, really useful feature of xargs. ;)
$ echo | xargs --show-limits
Your environment variables take up 2222 bytes
POSIX upper limit on argument length (this system): 2092882
POSIX smallest allowable upper limit on argument length (all systems): 4096
Maximum length of command we could actually use: 2090660
Size of command buffer we are actually using: 131072
Maximum parallelism (--max-procs must be no greater): 2147483647
It’s not a limit on the number of arguments, it’s a limit on the total size of environment variables + command-line arguments (+ some other data, see getauxval(3) on a Linux machine for details). Apparently Linux defaults to a quarter of the available stack allocated for new processes, but it also has a hard limit of 128KiB on the size of each individual argument (MAX_ARG_STRLEN). There’s also MAX_ARG_STRINGS which limits the number of arguments, but it’s set to 2³¹-1, so you’ll hit the ~2MiB limit first.
Needless to say, a lot of these numbers are much smaller on other POSIX systems, like BSDs or macOS.
find . -exec blah will fork a process for each file, while find . | xargs blah will fork a process per X files (where X is the system wide argument length limit). The later could run quite a bit faster. I will typically do find . -name '*.h' | xargs grep SOME_OBSCURE_DEFINE and depending upon the repo, that might only expand to one grep.
As @jonahx mentions, there is an option for that in find too:
-exec utility [argument ...] {} +
Same as -exec, except that ``{}'' is replaced with as many pathnames as possible for each invocation of utility. This
behaviour is similar to that of xargs(1).
I didn’t know about the ‘+’ option to find, but I also use xargs with a custom script that scans for source files in a directory (not in sh or bash as I personally find shell scripting abhorrent).
That is the real beauty of xargs. I didn’t know about using + with find, and while that’s quite useful, remembering it means I need to remember something that only works with find. In contrast, xargs works with anything they can supply a newline-delimited list of filenames as input.
Conceptually, I think of xargs primarily as a wrapper that enables tools that don’t support stdin to support stdin. Is this a good way to think about it?
Yes I’d think of it as an “adapter” between text streams (stdin) and argv arrays. Both of those are essential parts of shell and you need ways to move back and forth. To move the other way you can simply use echo (or write -- @ARGV in Oil).
Another way I think of it is to replace xargs with the word “each” mentally, as in Ruby, Rust, and some common JS idioms.
You’re basically separating iteration from the logic of what to do on each thing. It’s a special case of a loop.
In a loop, the current iteration can depend on the previous iteration, and sometimes you need that. But in xargs, every iteration is independent, which is good because you can add xargs -P to automatically parallelize it! You can’t do that with a regular loop.
I would like Oil to grow an each builtin that is a cleaned up xargs, following the guidelines I enumerated.
I’ve been wondering if it should be named each and every?
each – like xargs -n 1, and find -exec foo \; – call a process on each argument
every – like xargs, and find -exec foo +` – call the minimal number of processes, but exhaust all arguments
So something like
proc myproc { echo $1 } # passed one arg
find . | each -- myproc # call a proc/shell function on each file, newlines are the default
proc otherproc { echo @ARGV } # passed many args
find . | every -- otherproc # call the minimal number of processes
If anyone has feedback I’m interested. Or wants to implement it :)
Probably should add this to the blog post: Why use xargs instead of a loop?
It’s easier to preview what you’re doing by sticking echo on the beginning of the command. You’re decomposing the logic of which things to iterate on, and what work to do.
When the work is independent, you can parallelize with xargs -P
You can filter the work with grep. Instead of find | xargs, do find | grep | xargs. This composes very nicely
If someone creates a file named -rf ~ # foo, you’re about to have a very bad time. You’ll also wind up spawning a process for every file, which may or may not affect things (probably not much for rm, but definitely for something that has a lot of setup/teardown).
Overall this post feels like someone who doesn’t understand why xargs exists saying it’s useless. Like, he doesn’t even mention argument escaping, the command line length limit…
Tangentially, the author has another post where he says that xterm has 1.6ms of latency, which makes it feel instant. I’d be very interested to know how xterm can display a character faster than the refresh rate of the display!
Tangentially, the author has another post where he says that xterm has 1.6ms of latency, which makes it feel instant
Presumably that is the latency that xterm alone adds to the pipeline, not the end-to-end latency. Keyboard debouncing alone generally adds 5-20ms of latency, so even ignoring the display 1.6ms is not really sensical.
Yeah, I’m sure the 1.6ms number is real (it cites a fairly in-depth-looking LWN article, which does show a 1.6ms number for uxterm), and decreased latency will still result in it being faster for you to see the character on screen (since the refresh rate means that adding even a single millisecond of terminal latency can add 16ms of display latency if you push over a refresh interval). It’s just that the way that the author phrased the paragraph makes me think he thinks that if you have xterm open, you really will see that character onscreen 1.6ms later. But it’s possible I’m just experiencing an inverse halo effect: he’s wrong about xargs, so I’m assuming he’s wrong about unrelated things.
Despite “The UNIX Way” saying that we have all these little composable command line tools that we can interop using the universal interchange language of plaintext, it is also said that we should never parse the output of ls. The reasons for this are unclear to me, patches that would have supported this have been rejected.
Definitely the glob is the right way to do this, and if things get more complex the find command.
“Never parse the output of ls” is a bit strong, but I can see the rationale for such a rule.
Basically the shell already knows how to list files with *.
for name in *; do # no external processes started here, just glob()
echo $name
done
That covers 90% of the use cases where you might want to parse the output of ls.
One case where you would is suggested by this article:
# Use a regex to filter Python or C++ tests, which is harder in the shell (at least a POSIX shell)
ls | egrep '.*_test.(py|cc)' | xargs -d $'\n' echo
BTW I’d say ls is a non-recursive special case of find, and ls lacks -print for formatting and -print0 for parseable output. It may be better to use find . -maxdepth 1 in some cases, but I’m comfortable with the above.
I gave an example below – if you want to filter by a regex and not a constant string.
# Use a regex to filter Python or C++ tests, which is harder in the shell (at least a POSIX shell)
ls | egrep '.*_test.(py|cc)' | xargs -d $'\n' echo
You can do this with extended globs too in bash, but that syntax is pretty obscure. You can also use regexes without egrep via [[. There are millions of ways to do everything in shell :)
I’d say that globs and find cover 99% of use cases, I can see ls | egrep being useful on occasion.
If normal globs aren’t enough, I’d use extended glob or find. But yeah, find would require options to prevent hidden files and recursive search compared to default ls. If this is something that is needed often, I’d make a function and put it in .bashrc.
That said, I’d use *_test.{py,cc} for your given example and your regex should be .*_test\.(py|cc)$ or _test\.(py|cc)$
I have parsed ls occasionally too - ex: -X to sort by extension, -q and pipe to wc for counting files, -t for sorting by time, etc.
And I missed the case of too many arguments for rm *foo* (for which I’d use find again) regarding the comment I made. I should’ve just read the article enough to know why ls | grep was being used.
I’m interested in any counterexamples. The difference appears to be that -n works on args that were already tokenized by -d or -0, while -L has its own tokenization rules? I think the former is better because it’s more orthogonal to the rest of the command.
I don’t think -L can always be replaced with -n. They appear the same because seq 10 gives only one token on each line, and -L aggregates lines. Look what happens if you have 3 tokens on each line, for example:
Yeah I shouldn’t have said “always replace”. I think it’s more like “-L is never what you want; you want -n” :) It does something different that’s not good. Again I’d be interested in any realistic counterexamples
Eh, there are some problems with xargs, but this isn’t a good critique. First off it proposes a a “solution” that doesn’t even handle spaces in filenames (much less say newlines):
I prefer this as a practical solution (that handles every char except newlines in filenames):
You can also pipe
find . -print0
toxargs -0
if you want to handle newlines (untrusted data).(Although then you have the problem that there’s no
grep -0
, which is why Oil has QSN. grep still works on QSN, and QSN can represent every string, even those with NULs!)One nice thing about xargs is that you can preview the commands by adding ‘echo’ on the front:
That will help get the tokenization right, so you don’t feed the wrong thing into the commands!
I never use xargs -L, and I sometimes use
xargs -I {}
for simple invocations. But even better than that is using xargs with the$0
Dispatch pattern, which I still need properly write about.Basically instead of the mini language of
-I {}
, just use shell by recursively invoking shell functions. I use this all the time, e.g. all over Oil and elsewhere.Now run with
myscript.sh do_all
, ormy_script.sh do_one
to test out the “work” function (very handy! you need to make this work first)This separates the problem nicely – make it work on one thing, and then figure out which things to run it on. When you combine them, they WILL work, unlike the “sed into bash” solution.
Reading up on what
xargs -L
does, I have avoided it because it’s a custom mini-language. It says that trailing blanks cause line continuations. Those sort of rules are silly to me.I also avoid
-I {}
because it’s a custom mini-language.IMO it’s better to just use the shell, and one of these three invocations:
$'\n'
– when you want linesxargs -0
– when you want to handle untrusted data (e.g. someone putting a newline in a filename)Those 3 can be combined with
-n 1
or-n 42
, and they will do the desired grouping. I’ve never needed anything more than that.So yes xargs is weird, but I don’t agree with the author’s suggestions.
sed
piped intobash
means that you’re manipulating bash code with sed, which is almost impossible to do correctly.Instead I suggest combining xargs and shell, because xargs works with arguments and not strings. You can make that correct and reason about what it doesn’t handle (newlines, etc.)
(OK I guess this is a start of a blog post, I also gave a 5 minute presentation 3 years ago about this: http://www.oilshell.org/share/05-24-pres.html)
I use
find . -exec
very often for running a command on lots of files. Why would you choose to pipe intoxargs
instead?It can be much faster (depending on the use case). If you’re trying to
rm
100,000 files, you can start one process instead of 100,000 processes! (the max number of args to a process on Linux is something like 131K as far as I remember).It’s basically
vs.
Here’s a comparison showing that
find -exec
is slower:https://www.reddit.com/r/ProgrammingLanguages/comments/frhplj/some_syntax_ideas_for_a_shell_please_provide/fm07izj/
Another reference: https://old.reddit.com/r/commandline/comments/45xxv1/why_find_stat_is_much_slower_than_ls/
Good question, I will add this to the hypothetical blog post! :)
@andyc Wouldn’t the find
+
(rather than;
) option solve this problem too?Oh yes, it does! I don’t tend to use it, since I use xargs for a bunch of other stuff too, but that will also work. Looks like busybox supports it to in addition to GNU (I would guess it’s in POSIX).
Time for the other really, really useful feature of xargs. ;)
It’s not a limit on the number of arguments, it’s a limit on the total size of environment variables + command-line arguments (+ some other data, see
getauxval(3)
on a Linux machine for details). Apparently Linux defaults to a quarter of the available stack allocated for new processes, but it also has a hard limit of 128KiB on the size of each individual argument (MAX_ARG_STRLEN
). There’s alsoMAX_ARG_STRINGS
which limits the number of arguments, but it’s set to 2³¹-1, so you’ll hit the ~2MiB limit first.Needless to say, a lot of these numbers are much smaller on other POSIX systems, like BSDs or macOS.
find . -exec blah
will fork a process for each file, whilefind . | xargs blah
will fork a process per X files (where X is the system wide argument length limit). The later could run quite a bit faster. I will typically dofind . -name '*.h' | xargs grep SOME_OBSCURE_DEFINE
and depending upon the repo, that might only expand to one grep.As @jonahx mentions, there is an option for that in
find
too:I didn’t know about the ‘+’ option to
find
, but I also usexargs
with a custom script that scans for source files in a directory (not insh
orbash
as I personally find shell scripting abhorrent).That is the real beauty of xargs. I didn’t know about using + with find, and while that’s quite useful, remembering it means I need to remember something that only works with find. In contrast, xargs works with anything they can supply a newline-delimited list of filenames as input.
Yes, this. Even though the original post complains about too many features in
xargs
,find
is truly the worst with a million options.This comment was a great article in itself.
Conceptually, I think of xargs primarily as a wrapper that enables tools that don’t support stdin to support stdin. Is this a good way to think about it?
Yes I’d think of it as an “adapter” between text streams (stdin) and argv arrays. Both of those are essential parts of shell and you need ways to move back and forth. To move the other way you can simply use
echo
(orwrite -- @ARGV
in Oil).Another way I think of it is to replace
xargs
with the word “each” mentally, as in Ruby, Rust, and some common JS idioms.You’re basically separating iteration from the logic of what to do on each thing. It’s a special case of a loop.
In a loop, the current iteration can depend on the previous iteration, and sometimes you need that. But in
xargs
, every iteration is independent, which is good because you can addxargs -P
to automatically parallelize it! You can’t do that with a regular loop.I would like Oil to grow an
each
builtin that is a cleaned up xargs, following the guidelines I enumerated.I’ve been wondering if it should be named
each
andevery
?each
– like xargs -n 1, andfind -exec foo \;
– call a process on each argumentevery
– likexargs
, andfind -exec foo
+` – call the minimal number of processes, but exhaust all argumentsSo something like
If anyone has feedback I’m interested. Or wants to implement it :)
Probably should add this to the blog post: Why use
xargs
instead of a loop?echo
on the beginning of the command. You’re decomposing the logic of which things to iterate on, and what work to do.xargs -P
grep
. Instead offind | xargs
, dofind | grep | xargs
. This composes very nicelyIf someone creates a file named
-rf ~ # foo
, you’re about to have a very bad time. You’ll also wind up spawning a process for every file, which may or may not affect things (probably not much for rm, but definitely for something that has a lot of setup/teardown).Overall this post feels like someone who doesn’t understand why xargs exists saying it’s useless. Like, he doesn’t even mention argument escaping, the command line length limit…
Tangentially, the author has another post where he says that xterm has 1.6ms of latency, which makes it feel instant. I’d be very interested to know how xterm can display a character faster than the refresh rate of the display!
Presumably that is the latency that xterm alone adds to the pipeline, not the end-to-end latency. Keyboard debouncing alone generally adds 5-20ms of latency, so even ignoring the display 1.6ms is not really sensical.
Yeah, I’m sure the 1.6ms number is real (it cites a fairly in-depth-looking LWN article, which does show a 1.6ms number for uxterm), and decreased latency will still result in it being faster for you to see the character on screen (since the refresh rate means that adding even a single millisecond of terminal latency can add 16ms of display latency if you push over a refresh interval). It’s just that the way that the author phrased the paragraph makes me think he thinks that if you have xterm open, you really will see that character onscreen 1.6ms later. But it’s possible I’m just experiencing an inverse halo effect: he’s wrong about xargs, so I’m assuming he’s wrong about unrelated things.
Judging by the comments here I’m not interested in reading the article.
But, why use
ls | grep foo
at all instead of*foo*
as the argument forrm
?I was also distracted by using the output of ls in scripting, which is a golden rule no-no.
Is this not what ls -D is for?
Despite “The UNIX Way” saying that we have all these little composable command line tools that we can interop using the universal interchange language of plaintext, it is also said that we should never parse the output of
ls
. The reasons for this are unclear to me, patches that would have supported this have been rejected.Definitely the glob is the right way to do this, and if things get more complex the
find
command.“Never parse the output of ls” is a bit strong, but I can see the rationale for such a rule.
Basically the shell already knows how to list files with
*
.That covers 90% of the use cases where you might want to parse the output of
ls
.One case where you would is suggested by this article:
BTW I’d say
ls
is a non-recursive special case offind
, andls
lacks-print
for formatting and-print0
for parseable output. It may be better to usefind . -maxdepth 1
in some cases, but I’m comfortable with the above.Almost always, I use the shell iteratively, working stepwise to my goal. Pipelines like that are the outcome of that process.
I gave an example below – if you want to filter by a regex and not a constant string.
You can do this with extended globs too in bash, but that syntax is pretty obscure. You can also use regexes without
egrep
via[[
. There are millions of ways to do everything in shell :)I’d say that globs and find cover 99% of use cases, I can see
ls | egrep
being useful on occasion.If normal globs aren’t enough, I’d use extended glob or
find
. But yeah,find
would require options to prevent hidden files and recursive search compared to defaultls
. If this is something that is needed often, I’d make a function and put it in.bashrc
.That said, I’d use
*_test.{py,cc}
for your given example and your regex should be.*_test\.(py|cc)$
or_test\.(py|cc)$
I have parsed
ls
occasionally too - ex:-X
to sort by extension,-q
and pipe towc
for counting files,-t
for sorting by time, etc.And I missed the case of too many arguments for
rm *foo*
(for which I’d usefind
again) regarding the comment I made. I should’ve just read the article enough to know whyls | grep
was being used.That’s clearly just a placeholder pipeline. No one actually wants
*foo*
anyhow.it seems
paste
has been forgotten.I disagree with the article in general, but I’ll give it an up-vote because I learnt about the
-L
option which I’ve never seen/used before.I’ve never used it but I claim it can be always replaced with
-n
? (see my long comment here)I’m interested in any counterexamples. The difference appears to be that
-n
works on args that were already tokenized by-d
or-0
, while-L
has its own tokenization rules? I think the former is better because it’s more orthogonal to the rest of the command.Here’s a longer example:
It correctly does the tokenization you want (split on newlines), and then produces batches of 2 args.
I don’t think
-L
can always be replaced with-n
. They appear the same becauseseq 10
gives only one token on each line, and-L
aggregates lines. Look what happens if you have 3 tokens on each line, for example:While
-n
is tokens:I agree that
-n
seems more generally useful.Yeah I shouldn’t have said “always replace”. I think it’s more like “-L is never what you want; you want -n” :) It does something different that’s not good. Again I’d be interested in any realistic counterexamples