Also, it depends on which AWK implementation. mawk is generally much faster than gawk.
Setting up the rough equivalent to what is in the post for the first “benchmark”, I get 0.164s for cut, 0.225s for mawk, and 0.413s for gawk. Similar ratios with the other test.
I find the conclusion of the post to be pretty flimsy.
I agree the conclusion is flimsy. Much also depends upon the CPU. I just tested gawk-5.1.0 as between 1.20x and 1.55x faster (4 diff CPUs) than mawk-1.3.4_p20200106 for “extracting column 2”.
This makes the mistaken assumption that the reader doesn’t care what the output will look like.
while cut and awk do mostly the same thing, they can behave vastly differently, see:
mattrose@rome ~ % cat cutvawk
foo bar
foo bar
foo bar
mattrose@rome ~ % cat cutvawk | awk '{print $2}'
bar
bar
bar
mattrose@rome ~ % cat cutvawk | cut -d ' ' -f 2
bar
foo bar
Speed tests are fine, but they won’t tell you the right tool to use for any given job.
The author specifically addresses this problem directly, clearly, and explicitly in their post. As cut doesn’t handle arbitrary spacing, they use tr to clean up the spacing first. Whether that’s squeezing spaces or converting tabs, tr can do it.
This makes the mistaken assumption that the reader doesn’t care what the output will look like.
This makes the mistaken assumption that the author is a complete idiot. Obviously they care about correct output.
Presumably the GNU project thinks sequences of whitespaces should be handled by awk, it’s referenced in the info page for cut:
Note awk supports more sophisticated field processing, like reordering fields, and handling fields aligned with blank characters. By default awk uses (and discards) runs of blank characters to separate fields, and ignores leading and trailing blanks.
[awk invocations snipped]
This shows that the perennial discussion about “one thing well” and composability is granular and not really separable into “GNU just extends everything, *BSD keeps stuff small”, as the FreeBSD version of cut is “extended” to not need awk for successive runs of whitespace.
Yeah, it’s just a common use case; awk is an entire programming language and resorting to it for these kind of things is a bit overkill. Practicality beats “do one thing well” kind of purity IMO. And I also don’t think it really goes against that in the first place (it “feels” natural to me, very wishy-washy I know, but these kind of discussions tend to be).
If the author cared about output, why would he cat the results to /dev/null in the OP?
My point is that there are considerations other than raw speed, when deciding between cut and awk, and awk is far more forgiving of invisible whitespace difference in the input than cut is, and this is not really mentioned in the post, even though I’ve seen it happen with cut so many times.
If the author cared about output, why would he cat the results to /dev/null in the OP?
To better present the timing information in a blog post, and to better measure the speed of these programs without accidentally measuring the speed of their terminal emulator at consuming the output.
Seriously, if the author didn’t care about output, why bother using tr in their second example at all?
awk is far more forgiving of invisible whitespace difference in the input than cut is, and this is not really mentioned in the post
It’s explicitly mentioned in the post. See example 2, where the author explains using tr -s for exactly this reason.
The second example uses tr -s ' ' to collapse runs of spaces. If your input contains tabs as well, you could compress them too with tr -s ' \t' ' '. As I understand it,
tr -s ' \t' ' ' < inputfile | cut -d ' ' $N
will give the same output as
awk "{print \$$N}" < inputfile
for all positive integer values of N (up to overflow).
The awk I have (gawk 5.1) will not remove leading or trailing newlines in the file and otherwise processes the file line-wise, but it will strip leading spaces and tabs before counting fields and cut does not.
I have a big soft spot for Python-based scripting (used here as the “slow” benchmark), and like… when you can proficiently do your one-time filtering the fact that it takes 5 seconds instead of 0.5 is not really a big deal. But thinking about this a bit more, when you’re iterating and trying to “tweak” stuff, your single-line awk command starts being nice (much quicker to iterate).
Though I do wish that it was easier in a shell to, like, jump around arguments and edit multiline arguments without being deathly afraid of hitting enter or whatnot, without having to just resort to working inside of a shell script.
Kinda surprises me that there’s no shell that leans into how args work to provide a nicer interface for them (something like “enter puts you onto the next arg, shift-enter runs the command” or w/e, so you don’t need to worry about escaping).
Though I do wish that it was easier in a shell to, like, jump around arguments and edit multiline arguments without being deathly afraid of hitting enter or whatnot, without having to just resort to working inside of a shell script.
I will assume you are not aware that one can edit a command-line in your $EDITOR.
edit-and-execute-command (C-xC-e)
Invoke an editor on the current command line, and execute the result as shell commands. Bash attempts to invoke $VISUAL, $EDITOR, and emacs as the editor, in that order.
Also, it depends on which AWK implementation.
mawk
is generally much faster thangawk
.Setting up the rough equivalent to what is in the post for the first “benchmark”, I get 0.164s for
cut
, 0.225s formawk
, and 0.413s forgawk
. Similar ratios with the other test.I find the conclusion of the post to be pretty flimsy.
I agree the conclusion is flimsy. Much also depends upon the CPU. I just tested gawk-5.1.0 as between 1.20x and 1.55x faster (4 diff CPUs) than mawk-1.3.4_p20200106 for “extracting column 2”.
til about the -s arg to tr
This makes the mistaken assumption that the reader doesn’t care what the output will look like.
while cut and awk do mostly the same thing, they can behave vastly differently, see:
Speed tests are fine, but they won’t tell you the right tool to use for any given job.
The author specifically addresses this problem directly, clearly, and explicitly in their post. As
cut
doesn’t handle arbitrary spacing, they usetr
to clean up the spacing first. Whether that’s squeezing spaces or converting tabs,tr
can do it.This makes the mistaken assumption that the author is a complete idiot. Obviously they care about correct output.
FreeBSD
cut
has-w
to “Use whitespace (spaces and tabs) as the delimiter. Consecutive spaces and tabs count as one single field separator.”It’s had this for years, and something I miss in GNU
cut
; it’s pretty useful. Maybe I should look at sending a patch.Presumably the GNU project thinks sequences of whitespaces should be handled by
awk
, it’s referenced in the info page forcut
:This shows that the perennial discussion about “one thing well” and composability is granular and not really separable into “GNU just extends everything, *BSD keeps stuff small”, as the FreeBSD version of
cut
is “extended” to not needawk
for successive runs of whitespace.(OpenBSD
cut
does not have the-w
option: https://man.openbsd.org/cut)Yeah, it’s just a common use case; awk is an entire programming language and resorting to it for these kind of things is a bit overkill. Practicality beats “do one thing well” kind of purity IMO. And I also don’t think it really goes against that in the first place (it “feels” natural to me, very wishy-washy I know, but these kind of discussions tend to be).
If the author cared about output, why would he cat the results to /dev/null in the OP?
My point is that there are considerations other than raw speed, when deciding between cut and awk, and awk is far more forgiving of invisible whitespace difference in the input than cut is, and this is not really mentioned in the post, even though I’ve seen it happen with cut so many times.
To better present the timing information in a blog post, and to better measure the speed of these programs without accidentally measuring the speed of their terminal emulator at consuming the output.
Seriously, if the author didn’t care about output, why bother using
tr
in their second example at all?It’s explicitly mentioned in the post. See example 2, where the author explains using
tr -s
for exactly this reason.You always need to know what your input looks like, right?
I had this exact case in mind reading this. It’s happened a lot and is why I’ve defaulted to always using Awk.
Yeah, they’re different tools that do different things and conflating them like this has bitten people in the backside before
The second example uses
tr -s ' '
to collapse runs of spaces. If your input contains tabs as well, you could compress them too withtr -s ' \t' ' '
. As I understand it,will give the same output as
for all positive integer values of N (up to overflow).
that is still not the same as default
awk
behavior, becauseawk
will remove leading/trailing space/tab/newlines as wellAh, nice. Thanks for the correction.
The
awk
I have (gawk 5.1) will not remove leading or trailing newlines in the file and otherwise processes the file line-wise, but it will strip leading spaces and tabs before counting fields andcut
does not.newlines come into picture when the input record separator doesn’t remove it
here’s an example that I answered few days back: https://stackoverflow.com/questions/64870968/sed-read-a-file-get-a-block-before-specific-line/64875721#64875721
Sure. I was talking about the defaults, tho.
I have a big soft spot for Python-based scripting (used here as the “slow” benchmark), and like… when you can proficiently do your one-time filtering the fact that it takes 5 seconds instead of 0.5 is not really a big deal. But thinking about this a bit more, when you’re iterating and trying to “tweak” stuff, your single-line awk command starts being nice (much quicker to iterate).
Though I do wish that it was easier in a shell to, like, jump around arguments and edit multiline arguments without being deathly afraid of hitting enter or whatnot, without having to just resort to working inside of a shell script.
Kinda surprises me that there’s no shell that leans into how args work to provide a nicer interface for them (something like “enter puts you onto the next arg, shift-enter runs the command” or w/e, so you don’t need to worry about escaping).
I will assume you are not aware that one can edit a command-line in your
$EDITOR
.References:
I was thinking of some slightly different mechanisms but this is very interesting and seems useful! Thank you for the tip
Something like PowerShell’s ISE (Integrated Scripting Environment) would be really cool.
Still very much the case in UNIX-like operating systems I use (OpenBSD)
Does OpenBSD still have both
cat -v
andvis
?Yes