1. 15
    1. 34

      when i got to google NYC i was sitting next to an older guy who seemed to spend all the workday talking to his broker. a few weeks in, i finally caught him doing some work, as he cursed at a lengthy awk invocation. seeking to impress him with how plugged-into the NYC scene i was despite just having come up from georgia, i opined “if you’re having trouble with awk, you know, kernighan is a few floors up, and aho is down the street. they’re the a and k in awk.” he stared balefully at me. “the w stands for ‘who-the-hell-knows’, heh.” thunderclouds descended. “i’m peter weinberger. i’m the w in awk.” we didn’t talk much after that, and one of the monthly reorgs moved the go team away soon thereafter.

      1. 7

        Ha, that’s hilarious. After I wrote GoAWK, I was chatting offline with Alan Donovan (co-author of The Go Programming Language) and I found out Weinberger worked with him on the Go team, so I asked to meet up with them. We talked about what he worked on then (gopls, I think), what he’d do differently in AWK if writing it now, and we talked about Elm, which I’d just learnt. It was a fun lunch!

      2. 4

        “I’ve forgotten more about AWK than you’ll ever know!”

    2. 9

      Here’s the feature I wish awk had: regex groups.

      Awk has deep support for regexes, allowing a script to grab lines matching subtle patterns and process them.

      Awk has deep support for tabular data, allowing a script to pick out individual columns, process them, and spit out new columns in response.

      For some reason, Awk doesn’t let you write a regex that matches interesting parts of a line, and then process them. The best you can do is write a regex that matches all the interesting parts of a line, then write a bunch of gsub() calls with regexes that match each individual part. Those regexes are similar to the original, but each is different in small, easy-to-get-wrong ways.

      I get it, it’s an old language, I shouldn’t judge it against my modern perspective. But still, writing a regex to pick out parts of a line, and then a block of code to process them, feels like the awkiest thing ever, and it baffles me that it doesn’t work like that.

      1. 9

        If you have gawk, you can use match() function and access matched portions via array. You’d still need a loop if there are multiple matches:

        # using substr and RSTART/RLENGTH
        $ s='051 035 154 12 26 98234'
        $ echo "$s" | awk 'match($0, /[0-9]{4,}/){print substr($0, RSTART, RLENGTH)}'
        # using array 3rd argument in gawk
        $ echo "$s" | awk 'match($0, /[0-9]{4,}/, m){print m[0]}'
        # matched portion of first capture group
        $ echo 'foo=42, baz=314' | awk 'match($0, /baz=([0-9]+)/, m){print m[1]}'
        # extract numbers only if it is followed by a comma
        $ s='42 foo-5, baz3; x-83, y-20: f12'
        $ echo "$s" | awk '{ while( match($0, /([0-9]+),/, m) ){print m[1];
                           $0=substr($0, RSTART+RLENGTH)} }'
      2. 2

        The reason awk (nawk) doesn’t is because its regex implementation is based on automata, and adding capture groups to automata is essentially an open research problem. It certainly wasn’t well understood back then, and it still doesn’t seem to be now, in 2021. I’ve been chipping away at it for a while. The interactions with anchoring get incredibly gnarly.

        If you look at implementations that do support captures, such as PCRE, they tend to use a completely different operational model. PCRE is interpreted, with a JIT compiler to close some of the performance gaps, but it’s still really easy to write regexes that would cause PCRE to block for weeks due to backtracking. It has resource limits in place to prevent DoS attacks, but then the matches just fail with an error code indicating which resource (stack depth, etc.) limit was exceeded.

        Something awk-like could be based on a more feature-ful regex implementation, but it would change the cost model considerably. (It sounds like gawk does; I’m only familiar with nawk.) At that point, I’d like to see it also have support for more sophisticated parser tooling.

      3. 2

        Yeah this is exactly why I use gawk , because you can capture groups with the match() function. It might be the only reason I use gawk!

        FWIW Bash actually lets you capture groups with [[ $x =~ $pat ]] and ${BASH_REMATCH[1]}, etc. If for some reason you’re using bash but not gawk.

        Oil makes this a little nicer with (x ~ pat) and _match(1) or _match('name')

        I agree this feature is missing from the traditional tools in the POSIX spec.

        1. 1

          Quick random idea. What if we combined structural regular expressions + awk? That would be something! I haven’t seen structural regexes used in practice at all.

          1. 1

            If I recall correctly, the structural regex paper includes a psuedocode example for an awk-like language based on structural regexes. Just looking at it, it seemed like the Obvious Right Thing To Do, and it makes me sad that nobody (including myself) seems to have implemented such a thing.

    3. 4

      awk is great but I find I can do pretty much all of its record-at-a-time parsing with perl’s -n or -p switches, and the latter’s autosplit -a switch saves a lot of time. The other record- and field-parsing switches documented in perlrun also get a lot of play. Plus if you’re needing to bring in other features like database connections or output in formats like JSON, XML, or YAML you can tap into modules from CPAN.

      awk will always win at raw speed for what it does, but it’s another slightly-less-appropriate-to-general-programming tool to learn and master. As soon as your pipeline gets a bit too complicated to comfortably implement in shell and awk you’ll be reaching for Perl anyway – that was its original use case, after all.

      1. 5

        One big difference is that in 2021, searching for Awk documentation and tutorials will give you information about record-at-a-time text processing, while searching for Perl documentation and tutorials digs up information about writing CGI scripts and maintaining legacy codebases. As somebody who occasionally wants to write portable text-processing tools, I’m pretty sure Perl would be a better choice than awk most of the time, but since I didn’t learn Perl before 1995 I feel like that knowledge has been lost.

        1. 2

          Thank you for the excellent blog post idea!

          1. 1

            Follow-up blog post here

      2. 1

        Agreed. I use Ruby, though. Ruby has much better syntax and has inherited -n & -p goodness from Perl.

    4. 2

      I’d wager that the number of Awk programmers is about the same as the number of Scheme programmers.

    5. 1

      There is another way to skin this cat. Awk is mostly the “programing language” version of a generic row processor. You can also just generate the whole program (pretty easily in almost any language you already like). E.g., here is one for Nim: https://github.com/c-blake/cligen/blob/master/examples/rp.nim with example “code” in the main doc comment. (They need input like seq 1 1000 or pastes of such.)

      With generated C rather than Nim and TinyC/tcc you can even get the start-up time down to similar to mawk/gawk. Row processing itself runs at full compiled speed, and optimize-compiled speed is only a gcc -O3 away if you have single machine data (a common case for me).

      Besides probably being faster on “enough” data to amortize compilation costs, “native compiled awk-ish-XYZ” can also be more type safe (with slightly more type ceremony like my s(0)/f(0)/i(0) instead of just $1). Depending upon your “target language”, keystrokes for “one liners” might well be less than any awk (e.g. Nim has less need for (), {}s, ‘;’). Adding new “powers” is often as close at hand as the stdlib of your target language and simple “include” maybe stowed just once in some config file.

      Type systems & brevity & start-up cost amortization aside, this “rotated idea” is worthy of serious consideration if for no other reasons than being “portable in concept” and about 100x simpler to implement something that can run faster.