1. 20
  1. 5

    Well, I guess I’m a command line wizard, though maybe yet fledgling, or bumbling.

    Given a text file and an integer k, print the k most common words in the file (and the number of their occurrences) in decreasing frequency.

    sed $'s/[^:alpha:]/\n/g' ${text_file}|sort |uniq -c|sort -gr|head ${integer_k}
    

    Does my sed support that :alpha:? Maybe not, but it probably will with a flag. Do I need to add a flag to make the first sort case insensitive? Maybe. McIlroy considered case sensitivity. Seems reasonable. Does my solution include the empty line in the count? Sure does, and it might make the top ${integer_k} list, depending on the input! That’d be a bug. Oh, another one: uniq may need a flag to become case insensitive…

    But, I didn’t think about case sensitivity until I read McIlroy’s response. My own solution took longer to type than to imagine… I built it in my mind in real time as I read the challenge.

    Here’s a sort of reenactment.

    most common words

    …get-words to sort-uniq-c-head…

    and the number of their occurrances

    …already have that…

    decreasing frequency

    …rev-sort-general-numeric…

    They aren’t English thoughts… They aren’t visualizations… Those unix commands are internalized.

    Such is the power of *nix! It’s addictive. It’s valuable.

    It’s hard to teach, too. You can’t ever forget it, though. You may forget the flags (or they may change over time), but with *nix in your toolbelt, everything looks like text.

    I’m filing John D. Cook’s post under ‘convert users’. Experience tells me I don’t know how to describe this stuff to muggles. John has made the attempt; I’ll keep it in mind for next time.

    1. 1

      whoops this:

      …get-words to sort-uniq-c-head…

      Was supposed to say something like this:

      …get-words to sort-uniq-c-sort-gr-head…

    2. 3

      This is pretty much exactly the kind of stuff we covered in the “data wrangling” lecture in our lecture series on programmer tools for anyone who’d like to learn more about this kind of “magic” :)

      1. 2

        honestly, I hate having to fight with many of these obtuse tools and I generally find it detestable when my solution depends on more than 5-6 levels of piping. I also don’t think people should be proud of one giant word-wrapped command. Shells support newline continuations using backslashes, people should USE them. Also long options go a long way to help people understand what massive shell transformation pipelines do.

        When I do need quick solutions done in a shell, I more often than not reach for awk first and try to solve my problem with it entirely in one invocation of awk. Only if I find the solution is becoming too verbose do I usually start piping it to more tools. For instance to solve that Bentley’s Exercise:

        man gawk | awk -v count=7 '{
            for (i=1; i<=NF; ++i) {
                gsub(/^[[:punct:]]*/, "", $i)
                gsub(/[[:punct:]]*$/, "", $i)
                if ($i == "") break
                words[$i]++
            }
        }
        
        END {
            biggest = -1
            for (i=1; i<=count; ++i) {
                for (word in words) {
                    if (words[word] > words[biggest])
                        biggest = word
                }
                if (words[biggest] == "") exit 0
                print biggest, words[biggest]
                delete words[biggest]
            }
        }'
        
        the 752
        of 319
        is 275
        a 268
        to 248
        and 202
        in 194
        

        Obviously this isn’t the most efficient solution and it is a bit more verbose than I like. In practice I generally don’t see problems where I’m required to implement a bunch of strange requirements outlined in Bentley’s Exercise in a shell environment.

        Ultimately shells are fragile and are prone to being misused; I think people should carefully consider when to jump from hacking away in a shell, to implementing a correct solution in a general purpose language like awk, python, javascript, etc.