Once we move beyond one-liners, a natural question is why. As in ‘Why not use Python? Isn’t it good at this type of thing?’
The reasons provided are fine, but for me the main reason is speed. AWK is much, much faster than Python for “line at a time” processing. When you have large files, the difference becomes clear. (perl -p can be a reasonable substitute.)
Once you are writing long AWK programs, though, it’s time to consider Python or something else. AWK isn’t very fun once data manipulation gets complicated.
+1. In my eyes, it’s Awk and then Perl. Perl turns out to be much better for these purposes than other scripting languages. The difference in startup time between Perl and Python is very significant. If you don’t use (m)any modules, Perl scripts usually start just as quickly as Awk scripts.
I’m sure that’s true for some kinds of scripts, but that doesn’t match my experience/benchmarks here (Python is somewhat faster than AWK for this case of counting unique words). For what programs did you find AWK “much, much faster”? I can imagine very small datasets being faster in AWK because it’s startup time is 3ms compared to Python’s 20ms.
Regarding your results, 3.55 under awk is with or without -b?
I get 1.774s (simple) and 1.136s (optimized) for Python. For simple awk, I get 2.552s (without -b) 1.537s (with -b). For optimized, I get 2.091s and 1.435s respectively. I’m using gawk here, mawk is of course faster.
Also, I’ve noticed that awk does poorly when there are large number of dictionary keys. If you are doing field based decisions, awk is likely to be much faster. I tried printing first field of each line (removed empty lines from your test file since line.split()[0] gives error for empty lines). I got 0.583s for Python compared to 0.176s (without -b) and 0.158s (with -b)
Once you are writing long AWK programs, though, it’s time to consider Python or something else. AWK isn’t very fun once data manipulation gets complicated.
I am consistently surprised that there aren’t more tools that support AWK-style “record oriented programming” (since a record need not be a line, if you change the record separator). I found this for Go, but that’s about it. This style of data interpretation comes up pretty often in my experience. I feel like as great as AWK is, we could do better - for example, what about something like AWK that can read directly from CSV (with proper support for quoting), assigning each row to a record, and perhaps with more natural support for headers.
You are right. Recently I was mixing AWK and Python in a way that AWK was producing key,value output easily readable and processed later by Python script. Nice, simple and quick to develop.
AWK is one of the most important languages in my toolbox. When I have to process lot of log files, there is no better tool. Unfortunately younger SDE are not very familiar with this language.
I stumbled into an ~all-awk project today and it made me think a link to a search full of mostly-awk projects might be a nice breadcrumb here: https://github.com/search?q=language%3Aawk
(I found github’s language search to be a little wonky a while back and I’m not sure if they’ve fixed it; it wouldn’t shock me if there are good all-awk projects missing from the search…)
The reasons provided are fine, but for me the main reason is speed. AWK is much, much faster than Python for “line at a time” processing. When you have large files, the difference becomes clear. (
perl -p
can be a reasonable substitute.)Once you are writing long AWK programs, though, it’s time to consider Python or something else. AWK isn’t very fun once data manipulation gets complicated.
+1. In my eyes, it’s Awk and then Perl. Perl turns out to be much better for these purposes than other scripting languages. The difference in startup time between Perl and Python is very significant. If you don’t use (m)any modules, Perl scripts usually start just as quickly as Awk scripts.
I’m sure that’s true for some kinds of scripts, but that doesn’t match my experience/benchmarks here (Python is somewhat faster than AWK for this case of counting unique words). For what programs did you find AWK “much, much faster”? I can imagine very small datasets being faster in AWK because it’s startup time is 3ms compared to Python’s 20ms.
Any time the input file is big. As in hundreds of MGs big.
I used to have to process 2GB+ of CSV on a regular basis and the AWK version was easily 5x faster than the Python version.
Was the Python version streaming, or did it read the whole file in at once?
Streaming.
Regarding your results,
3.55
underawk
is with or without-b
?I get
1.774s
(simple) and1.136s
(optimized) for Python. For simple awk, I get2.552s
(without -b)1.537s
(with -b). For optimized, I get2.091s
and1.435s
respectively. I’m usinggawk
here,mawk
is of course faster.Also, I’ve noticed that awk does poorly when there are large number of dictionary keys. If you are doing field based decisions, awk is likely to be much faster. I tried printing first field of each line (removed empty lines from your test file since
line.split()[0]
gives error for empty lines). I got0.583s
for Python compared to0.176s
(without -b) and0.158s
(with -b)Same here. If you are making extensive use of arrays, then AWK may not be the best tool.
I dunno, I think it’s pretty fun.
I am consistently surprised that there aren’t more tools that support AWK-style “record oriented programming” (since a record need not be a line, if you change the record separator). I found this for Go, but that’s about it. This style of data interpretation comes up pretty often in my experience. I feel like as great as AWK is, we could do better - for example, what about something like AWK that can read directly from CSV (with proper support for quoting), assigning each row to a record, and perhaps with more natural support for headers.
You are right. Recently I was mixing AWK and Python in a way that AWK was producing key,value output easily readable and processed later by Python script. Nice, simple and quick to develop.
AWK is one of the most important languages in my toolbox. When I have to process lot of log files, there is no better tool. Unfortunately younger SDE are not very familiar with this language.
I used to build computational biology pipelines and do a substantial amount of data processing with AWK and is one of the best languages I’ve used.
I stumbled into an ~all-awk project today and it made me think a link to a search full of mostly-awk projects might be a nice breadcrumb here: https://github.com/search?q=language%3Aawk
(I found github’s language search to be a little wonky a while back and I’m not sure if they’ve fixed it; it wouldn’t shock me if there are good all-awk projects missing from the search…)