1. 30
  1.  

  2. 9

    I’m glad somebody wrote this up, because I feel like everybody who works with big data learns these lessons independently (sometimes multiple times within the same organization). If you teach CS, please make your students read this before they graduate.

    Understanding these lessons is basically why the office I work at is so much more productive than the rest of the company we’re a part of: there’s been an effort to get incoming devs to understand that, in most cases, it’s faster, cheaper, & easier to use unix shell tools to process large data sets than to use fancy hypebeast toolchains like hadoop.

    There are a couple things this essay doesn’t mention that would probably speed up processing substantially. One is using LC_ALL=C – if you force locale to C, no locale processing occurs during piping, which speeds everything up a lot. Another is that if you are using GNU awk, there’s support for running commands and piping to them internally, which means that downloads can actually be done inside AWK and posts can be done there too – which allows you to open multiple input and output streams and switch between them in a single batch, avoiding some merge steps. Also, one might want to use xargs instead of gnu parallel, because xargs is a more mature tool & one that’s available on basically all unix machines out of the box.

    1. 4

      One thing I found particularly useful about this post (not evident from the title, but constitutes the first half) is specifics about how the Big Data Science Toolchains can fail, in this case Apache Spark, even when the author tried a bunch of the obvious and less-obvious fixes.

      The biggest win here seems to be not necessarily the raw processing time due to low-level optimizations in awk, but more big-picture algorithmic wins from “manually” controlling data locality, where Spark didn’t do the right thing automatically, and couldn’t be persuaded to do the right thing less automatically.

      1. 3

        Yeah I’ve personally run into exactly this kind of slowness with R (and Python to a lesser extent), and fixed it with shell. I love R but it can be very slow.

        That’s part of the reason I’m working on Oil. Shell is still useful and relevant but a lot of people are reluctant to learn it.

        I posted this in another thread, but it is good to eyeball your computations with “numbers every programmer should know”:

        https://gist.github.com/jboner/2841832

        https://people.eecs.berkeley.edu/~rcs/research/interactive_latency.html

        In particular most “parsing” is linear time, so my rule is that you want to be within 2x-10x of the hardware’s theoretical speed. With certain tools you will be more in the 100-1000x range, and then it’s time to use a different tool, probably to cut down the data first. Hardware is cheap but waiting for data pipelines costs human time.

        1. 4

          When I was working in my first lab I did exactly the same – moved an existing computational biology pipeline off of R to AWK, lots of shell plumbing, GNU Parallel, and a Flask front-end server which submitted jobs in GridEngine. Brought runtime down from 40 minutes to about 30 seconds for one genome. R is nice but can be slow (also, it was just a prototype.)

          The pivotal lesson I learned was to embrace the battle-tested technologies in the shell stack and everything Unix instead of fantsy-pantsy modern stacks and tools on top of Hadoop, Spark and others. “Someone probably solved your problem in the 80s” from the author rings absolutely true.

          Others call it Taco Bell programming.

          1. 2

            Taco Bell programming is amazing, I’m saddened by the fact that Taco Bell programming has become quite esoteric. I wish this knowledge was more widespread in the industry.

        2. 3

          Have them read “The Treacherous Optimization” which is all about how GNU grep is so fast: grep is important for its own sake, of course, but the point is that these tools have had decades of work poured into them, even the relatively new GNU tools which postdate the Classic Unix codebases.

          It’s also an interesting introduction to code optimization and engineering tradeoffs, or tradeoffs where multiple decisions are defensible because none of them are absolutely perfect.

          1. 1

            You must be very glad. 3 identical comments 😅

            1. 1

              Just a glitch. My mouse’s debounce doesn’t work properly, and lobste.rs doesn’t properly deduplicate requests, so when I click post it sometimes emits several duplicate requests which the server treats as duplicate comments (even though they come from the same form).

              There was a patch applied for this a year ago, but either it didn’t work or it never migrated from the git repo to the live version of the site.

          2. 1

            https://archive.is/j1mVb

            is the cached link, to help someone avoid a single click :P.

            1. 0

              I’m glad somebody wrote this up, because I feel like everybody who works with big data learns these lessons independently (sometimes multiple times within the same organization). If you teach CS, please make your students read this before they graduate.

              Understanding these lessons is basically why the office I work at is so much more productive than the rest of the company we’re a part of: there’s been an effort to get incoming devs to understand that, in most cases, it’s faster, cheaper, & easier to use unix shell tools to process large data sets than to use fancy hypebeast toolchains like hadoop.

              There are a couple things this essay doesn’t mention that would probably speed up processing substantially. One is using LC_ALL=C – if you force locale to C, no locale processing occurs during piping, which speeds everything up a lot. Another is that if you are using GNU awk, there’s support for running commands and piping to them internally, which means that downloads can actually be done inside AWK and posts can be done there too – which allows you to open multiple input and output streams and switch between them in a single batch, avoiding some merge steps. Also, one might want to use xargs instead of gnu parallel, because xargs is a more mature tool & one that’s available on basically all unix machines out of the box.