1. 24
  1.  

  2. 42

    Eh, there are some problems with xargs, but this isn’t a good critique. First off it proposes a a “solution” that doesn’t even handle spaces in filenames (much less say newlines):

    rm $(ls | grep foo)
    

    I prefer this as a practical solution (that handles every char except newlines in filenames):

    ls | grep foo | xargs -d $'\n' -- rm
    

    You can also pipe find . -print0 to xargs -0 if you want to handle newlines (untrusted data).

    (Although then you have the problem that there’s no grep -0, which is why Oil has QSN. grep still works on QSN, and QSN can represent every string, even those with NULs!)


    One nice thing about xargs is that you can preview the commands by adding ‘echo’ on the front:

    ls | grep foo | xargs -d $'\n' -- echo rm
    

    That will help get the tokenization right, so you don’t feed the wrong thing into the commands!

    I never use xargs -L, and I sometimes use xargs -I {} for simple invocations. But even better than that is using xargs with the $0 Dispatch pattern, which I still need properly write about.

    Basically instead of the mini language of -I {}, just use shell by recursively invoking shell functions. I use this all the time, e.g. all over Oil and elsewhere.

    do_one() {
       # It's more flexible to use a function with $1 instead of -I {}
       echo "Do something with $1"  
       echo mv $1 /tmp
    }
    
    do_all() {
      # call the do_one function for each item.  Also add -P to make it parallel
      cat tasks.txt | grep foo | xargs -n 1 -d $'\n' -- $0 do_one
    }
    
    "$@"  # dispatch on $0; or use 'runproc' in Oil
    

    Now run with

    • myscript.sh do_all, or
    • my_script.sh do_one to test out the “work” function (very handy! you need to make this work first)

    This separates the problem nicely – make it work on one thing, and then figure out which things to run it on. When you combine them, they WILL work, unlike the “sed into bash” solution.


    Reading up on what xargs -L does, I have avoided it because it’s a custom mini-language. It says that trailing blanks cause line continuations. Those sort of rules are silly to me.

    I also avoid -I {} because it’s a custom mini-language.

    IMO it’s better to just use the shell, and one of these three invocations:

    • xargs – when you know your input is “words” like myhost otherhost
    • xargs -d $'\n' – when you want lines
    • xargs -0 – when you want to handle untrusted data (e.g. someone putting a newline in a filename)

    Those 3 can be combined with -n 1 or -n 42, and they will do the desired grouping. I’ve never needed anything more than that.

    So yes xargs is weird, but I don’t agree with the author’s suggestions. sed piped into bash means that you’re manipulating bash code with sed, which is almost impossible to do correctly.

    Instead I suggest combining xargs and shell, because xargs works with arguments and not strings. You can make that correct and reason about what it doesn’t handle (newlines, etc.)

    (OK I guess this is a start of a blog post, I also gave a 5 minute presentation 3 years ago about this: http://www.oilshell.org/share/05-24-pres.html)

    1. 10

      pipe find . -print0 to xargs -0

      I use find . -exec very often for running a command on lots of files. Why would you choose to pipe into xargs instead?

      1. 12

        It can be much faster (depending on the use case). If you’re trying to rm 100,000 files, you can start one process instead of 100,000 processes! (the max number of args to a process on Linux is something like 131K as far as I remember).

        It’s basically

        rm one two three
        

        vs.

        rm one
        rm two
        rm three
        

        Here’s a comparison showing that find -exec is slower:

        https://www.reddit.com/r/ProgrammingLanguages/comments/frhplj/some_syntax_ideas_for_a_shell_please_provide/fm07izj/

        Another reference: https://old.reddit.com/r/commandline/comments/45xxv1/why_find_stat_is_much_slower_than_ls/

        Good question, I will add this to the hypothetical blog post! :)

        1. 15

          @andyc Wouldn’t the find + (rather than ;) option solve this problem too?

          1. 5

            Oh yes, it does! I don’t tend to use it, since I use xargs for a bunch of other stuff too, but that will also work. Looks like busybox supports it to in addition to GNU (I would guess it’s in POSIX).

          2. 11

            the max number of args to a process on Linux is something like 131K as far as I remember

            Time for the other really, really useful feature of xargs. ;)

            $ echo | xargs --show-limits
            Your environment variables take up 2222 bytes
            POSIX upper limit on argument length (this system): 2092882
            POSIX smallest allowable upper limit on argument length (all systems): 4096
            Maximum length of command we could actually use: 2090660
            Size of command buffer we are actually using: 131072
            Maximum parallelism (--max-procs must be no greater): 2147483647
            

            It’s not a limit on the number of arguments, it’s a limit on the total size of environment variables + command-line arguments (+ some other data, see getauxval(3) on a Linux machine for details). Apparently Linux defaults to a quarter of the available stack allocated for new processes, but it also has a hard limit of 128KiB on the size of each individual argument (MAX_ARG_STRLEN). There’s also MAX_ARG_STRINGS which limits the number of arguments, but it’s set to 2³¹-1, so you’ll hit the ~2MiB limit first.

            Needless to say, a lot of these numbers are much smaller on other POSIX systems, like BSDs or macOS.

          3. 1

            find . -exec blah will fork a process for each file, while find . | xargs blah will fork a process per X files (where X is the system wide argument length limit). The later could run quite a bit faster. I will typically do find . -name '*.h' | xargs grep SOME_OBSCURE_DEFINE and depending upon the repo, that might only expand to one grep.

            1. 5

              As @jonahx mentions, there is an option for that in find too:

                   -exec utility [argument ...] {} +
                           Same as -exec, except that ``{}'' is replaced with as many pathnames as possible for each invocation of utility.  This
                           behaviour is similar to that of xargs(1).
              
                1. 4

                  That is the real beauty of xargs. I didn’t know about using + with find, and while that’s quite useful, remembering it means I need to remember something that only works with find. In contrast, xargs works with anything they can supply a newline-delimited list of filenames as input.

                  1. 3

                    Yes, this. Even though the original post complains about too many features in xargs, find is truly the worst with a million options.

          4. 7

            This comment was a great article in itself.

            Conceptually, I think of xargs primarily as a wrapper that enables tools that don’t support stdin to support stdin. Is this a good way to think about it?

            1. 9

              Yes I’d think of it as an “adapter” between text streams (stdin) and argv arrays. Both of those are essential parts of shell and you need ways to move back and forth. To move the other way you can simply use echo (or write -- @ARGV in Oil).

              Another way I think of it is to replace xargs with the word “each” mentally, as in Ruby, Rust, and some common JS idioms.

              You’re basically separating iteration from the logic of what to do on each thing. It’s a special case of a loop.

              In a loop, the current iteration can depend on the previous iteration, and sometimes you need that. But in xargs, every iteration is independent, which is good because you can add xargs -P to automatically parallelize it! You can’t do that with a regular loop.


              I would like Oil to grow an each builtin that is a cleaned up xargs, following the guidelines I enumerated.

              I’ve been wondering if it should be named each and every?

              • each – like xargs -n 1, and find -exec foo \; – call a process on each argument
              • every – like xargs, and find -exec foo +` – call the minimal number of processes, but exhaust all arguments

              So something like

              proc myproc { echo $1 }   # passed one arg
              find . | each -- myproc  # call a proc/shell function on each file, newlines are the default
              
              proc otherproc { echo @ARGV }  # passed many args
              find . | every -- otherproc  # call the minimal number of processes
              

              If anyone has feedback I’m interested. Or wants to implement it :)


              Probably should add this to the blog post: Why use xargs instead of a loop?

              1. It’s easier to preview what you’re doing by sticking echo on the beginning of the command. You’re decomposing the logic of which things to iterate on, and what work to do.
              2. When the work is independent, you can parallelize with xargs -P
              3. You can filter the work with grep. Instead of find | xargs, do find | grep | xargs. This composes very nicely
          5. 14

            ls | grep foo | sed ‘s/.*/rm &/’ | bash

            If someone creates a file named -rf ~ # foo, you’re about to have a very bad time. You’ll also wind up spawning a process for every file, which may or may not affect things (probably not much for rm, but definitely for something that has a lot of setup/teardown).

            Overall this post feels like someone who doesn’t understand why xargs exists saying it’s useless. Like, he doesn’t even mention argument escaping, the command line length limit…

            Tangentially, the author has another post where he says that xterm has 1.6ms of latency, which makes it feel instant. I’d be very interested to know how xterm can display a character faster than the refresh rate of the display!

            1. 4

              Tangentially, the author has another post where he says that xterm has 1.6ms of latency, which makes it feel instant

              Presumably that is the latency that xterm alone adds to the pipeline, not the end-to-end latency. Keyboard debouncing alone generally adds 5-20ms of latency, so even ignoring the display 1.6ms is not really sensical.

              1. 3

                Yeah, I’m sure the 1.6ms number is real (it cites a fairly in-depth-looking LWN article, which does show a 1.6ms number for uxterm), and decreased latency will still result in it being faster for you to see the character on screen (since the refresh rate means that adding even a single millisecond of terminal latency can add 16ms of display latency if you push over a refresh interval). It’s just that the way that the author phrased the paragraph makes me think he thinks that if you have xterm open, you really will see that character onscreen 1.6ms later. But it’s possible I’m just experiencing an inverse halo effect: he’s wrong about xargs, so I’m assuming he’s wrong about unrelated things.

            2. 5

              Judging by the comments here I’m not interested in reading the article.

              But, why use ls | grep foo at all instead of *foo* as the argument for rm?

              1. 6

                I was also distracted by using the output of ls in scripting, which is a golden rule no-no.

                1. 1

                  Is this not what ls -D is for?

                2. 5

                  Despite “The UNIX Way” saying that we have all these little composable command line tools that we can interop using the universal interchange language of plaintext, it is also said that we should never parse the output of ls. The reasons for this are unclear to me, patches that would have supported this have been rejected.

                  Definitely the glob is the right way to do this, and if things get more complex the find command.

                  1. 5

                    “Never parse the output of ls” is a bit strong, but I can see the rationale for such a rule.

                    Basically the shell already knows how to list files with *.

                    for name in *; do  # no external processes started here, just glob()
                       echo $name
                    done
                    

                    That covers 90% of the use cases where you might want to parse the output of ls.

                    One case where you would is suggested by this article:

                    # Use a regex to filter Python or C++ tests, which is harder in the shell (at least a POSIX shell)
                    ls | egrep '.*_test.(py|cc)' | xargs -d $'\n' echo
                    

                    BTW I’d say ls is a non-recursive special case of find, and ls lacks -print for formatting and -print0 for parseable output. It may be better to use find . -maxdepth 1 in some cases, but I’m comfortable with the above.

                  2. 3

                    why use ls | grep foo at all instead of *foo* as the argument for rm

                    Almost always, I use the shell iteratively, working stepwise to my goal. Pipelines like that are the outcome of that process.

                    1. 2

                      I gave an example below – if you want to filter by a regex and not a constant string.

                      # Use a regex to filter Python or C++ tests, which is harder in the shell (at least a POSIX shell)
                      ls | egrep '.*_test.(py|cc)' | xargs -d $'\n' echo
                      

                      You can do this with extended globs too in bash, but that syntax is pretty obscure. You can also use regexes without egrep via [[. There are millions of ways to do everything in shell :)

                      I’d say that globs and find cover 99% of use cases, I can see ls | egrep being useful on occasion.

                      1. 1

                        If normal globs aren’t enough, I’d use extended glob or find. But yeah, find would require options to prevent hidden files and recursive search compared to default ls. If this is something that is needed often, I’d make a function and put it in .bashrc.

                        That said, I’d use *_test.{py,cc} for your given example and your regex should be .*_test\.(py|cc)$ or _test\.(py|cc)$

                        I have parsed ls occasionally too - ex: -X to sort by extension, -q and pipe to wc for counting files, -t for sorting by time, etc.

                        And I missed the case of too many arguments for rm *foo* (for which I’d use find again) regarding the comment I made. I should’ve just read the article enough to know why ls | grep was being used.

                      2. 1

                        That’s clearly just a placeholder pipeline. No one actually wants *foo* anyhow.

                      3. 5
                        $ touch '--no-preserve-root / foo' 'foo'
                        $ rm $(ls | grep foo)
                        
                        1. 4

                          it seems paste has been forgotten.

                          $ seq 10 | paste -d' ' - - -
                          1 2 3
                          4 5 6
                          7 8 9
                          10
                          
                          1. 4

                            I disagree with the article in general, but I’ll give it an up-vote because I learnt about the -L option which I’ve never seen/used before.

                            1. 5

                              I’ve never used it but I claim it can be always replaced with -n ? (see my long comment here)

                              $ seq 10 | xargs -n 3 -- echo
                              1 2 3
                              4 5 6
                              7 8 9
                              10
                              

                              I’m interested in any counterexamples. The difference appears to be that -n works on args that were already tokenized by -d or -0, while -L has its own tokenization rules? I think the former is better because it’s more orthogonal to the rest of the command.

                              Here’s a longer example:

                              $ { echo 'foo bar'; echo 'spam    eggs'; echo 'ale bean'; } | xargs -d $'\n' -n 2 -- ~/bin/argv
                              ['foo bar', 'spam    eggs']
                              ['ale bean']
                              

                              It correctly does the tokenization you want (split on newlines), and then produces batches of 2 args.

                              1. 3

                                I don’t think -L can always be replaced with -n. They appear the same because seq 10 gives only one token on each line, and -L aggregates lines. Look what happens if you have 3 tokens on each line, for example:

                                $ seq 10 | xargs -L 3 | xargs -L 2 
                                1 2 3 4 5 6
                                7 8 9 10
                                

                                While -n is tokens:

                                $ seq 10 | xargs -n 3 | xargs -n 2 
                                1 2
                                3 4
                                5 6
                                7 8
                                9 10
                                

                                I agree that -n seems more generally useful.

                                1. 3

                                  Yeah I shouldn’t have said “always replace”. I think it’s more like “-L is never what you want; you want -n” :) It does something different that’s not good. Again I’d be interested in any realistic counterexamples