1. 26
  1.  

  2. 5

    Also, it depends on which AWK implementation. mawk is generally much faster than gawk.

    Setting up the rough equivalent to what is in the post for the first “benchmark”, I get 0.164s for cut, 0.225s for mawk, and 0.413s for gawk. Similar ratios with the other test.

    I find the conclusion of the post to be pretty flimsy.

    1. 2

      I agree the conclusion is flimsy. Much also depends upon the CPU. I just tested gawk-5.1.0 as between 1.20x and 1.55x faster (4 diff CPUs) than mawk-1.3.4_p20200106 for “extracting column 2”.

    2. 5

      til about the -s arg to tr

      1. 6

        This makes the mistaken assumption that the reader doesn’t care what the output will look like.
        while cut and awk do mostly the same thing, they can behave vastly differently, see:

        mattrose@rome ~ % cat cutvawk
        foo  bar
        foo bar
        foo	bar
        
        mattrose@rome ~ % cat cutvawk | awk '{print $2}'
        bar
        bar
        bar
        
        mattrose@rome ~ % cat cutvawk | cut -d ' ' -f 2
        
        bar
        foo	bar
        
        

        Speed tests are fine, but they won’t tell you the right tool to use for any given job.

        1. 9

          The author specifically addresses this problem directly, clearly, and explicitly in their post. As cut doesn’t handle arbitrary spacing, they use tr to clean up the spacing first. Whether that’s squeezing spaces or converting tabs, tr can do it.

          This makes the mistaken assumption that the reader doesn’t care what the output will look like.

          This makes the mistaken assumption that the author is a complete idiot. Obviously they care about correct output.

          1. 7

            FreeBSD cut has -w to “Use whitespace (spaces and tabs) as the delimiter. Consecutive spaces and tabs count as one single field separator.”

            It’s had this for years, and something I miss in GNU cut; it’s pretty useful. Maybe I should look at sending a patch.

            1. 4

              Presumably the GNU project thinks sequences of whitespaces should be handled by awk, it’s referenced in the info page for cut:

              Note awk supports more sophisticated field processing, like reordering fields, and handling fields aligned with blank characters. By default awk uses (and discards) runs of blank characters to separate fields, and ignores leading and trailing blanks.

              [awk invocations snipped]

              This shows that the perennial discussion about “one thing well” and composability is granular and not really separable into “GNU just extends everything, *BSD keeps stuff small”, as the FreeBSD version of cut is “extended” to not need awk for successive runs of whitespace.

              (OpenBSD cut does not have the -w option: https://man.openbsd.org/cut)

              1. 4

                Yeah, it’s just a common use case; awk is an entire programming language and resorting to it for these kind of things is a bit overkill. Practicality beats “do one thing well” kind of purity IMO. And I also don’t think it really goes against that in the first place (it “feels” natural to me, very wishy-washy I know, but these kind of discussions tend to be).

            2. 1

              If the author cared about output, why would he cat the results to /dev/null in the OP?

              My point is that there are considerations other than raw speed, when deciding between cut and awk, and awk is far more forgiving of invisible whitespace difference in the input than cut is, and this is not really mentioned in the post, even though I’ve seen it happen with cut so many times.

              1. 3

                If the author cared about output, why would he cat the results to /dev/null in the OP?

                To better present the timing information in a blog post, and to better measure the speed of these programs without accidentally measuring the speed of their terminal emulator at consuming the output.

                Seriously, if the author didn’t care about output, why bother using tr in their second example at all?

                awk is far more forgiving of invisible whitespace difference in the input than cut is, and this is not really mentioned in the post

                It’s explicitly mentioned in the post. See example 2, where the author explains using tr -s for exactly this reason.

            3. 3

              You always need to know what your input looks like, right?

              % cat <<. | tr -s '     ' ' ' | cut -d ' ' -f 2
              foo  bar
              foo bar
              foo     bar
              .
              bar
              bar
              bar
              
              1. 1

                I had this exact case in mind reading this. It’s happened a lot and is why I’ve defaulted to always using Awk.

                1. 1

                  Yeah, they’re different tools that do different things and conflating them like this has bitten people in the backside before

                2. 1

                  The second example uses tr -s ' ' to collapse runs of spaces. If your input contains tabs as well, you could compress them too with tr -s ' \t' ' '. As I understand it,

                  tr -s ' \t' ' ' < inputfile | cut -d ' ' $N
                  

                  will give the same output as

                  awk "{print \$$N}" < inputfile
                  

                  for all positive integer values of N (up to overflow).

                  1. 3

                    that is still not the same as default awk behavior, because awk will remove leading/trailing space/tab/newlines as well

                    $ printf '    a  \t  b      3   '  | awk '{print $2}'
                    b
                    $ printf '    a  \t  b      3   '  | tr -s ' \t' ' ' | cut -d ' ' -f2
                    a
                    
                    1. 2

                      Ah, nice. Thanks for the correction.

                      The awk I have (gawk 5.1) will not remove leading or trailing newlines in the file and otherwise processes the file line-wise, but it will strip leading spaces and tabs before counting fields and cut does not.

                      1. 1

                        newlines come into picture when the input record separator doesn’t remove it

                        here’s an example that I answered few days back: https://stackoverflow.com/questions/64870968/sed-read-a-file-get-a-block-before-specific-line/64875721#64875721

                        1. 1

                          Sure. I was talking about the defaults, tho.

                3. 1

                  I have a big soft spot for Python-based scripting (used here as the “slow” benchmark), and like… when you can proficiently do your one-time filtering the fact that it takes 5 seconds instead of 0.5 is not really a big deal. But thinking about this a bit more, when you’re iterating and trying to “tweak” stuff, your single-line awk command starts being nice (much quicker to iterate).

                  Though I do wish that it was easier in a shell to, like, jump around arguments and edit multiline arguments without being deathly afraid of hitting enter or whatnot, without having to just resort to working inside of a shell script.

                  Kinda surprises me that there’s no shell that leans into how args work to provide a nicer interface for them (something like “enter puts you onto the next arg, shift-enter runs the command” or w/e, so you don’t need to worry about escaping).

                  1. 3

                    Though I do wish that it was easier in a shell to, like, jump around arguments and edit multiline arguments without being deathly afraid of hitting enter or whatnot, without having to just resort to working inside of a shell script.

                    I will assume you are not aware that one can edit a command-line in your $EDITOR.

                    edit-and-execute-command (C-xC-e)

                    Invoke an editor on the current command line, and execute the result as shell commands. Bash attempts to invoke $VISUAL, $EDITOR, and emacs as the editor, in that order.

                    References:

                    1. https://linux.die.net/man/1/bash
                    2. http://shellhaters.org/deck/#39
                    3. https://www.youtube.com/watch?v=olH-9b3VJfs&t=1264
                    1. 2

                      I was thinking of some slightly different mechanisms but this is very interesting and seems useful! Thank you for the tip

                    2. 1

                      Something like PowerShell’s ISE (Integrated Scripting Environment) would be really cool.

                    3. 1

                      In UNIX, there is (or used to be) a philosophy which, among other things, stipulates that a piece of software should do “one thing well”.

                      Still very much the case in UNIX-like operating systems I use (OpenBSD)

                      1. 1

                        Does OpenBSD still have both cat -v and vis?

                        1. 1

                          Yes