1. 53
  1.  

  2. 18

    I can’t say I support the author’s wacky idea of “Unix Zen” as a system development strategy, but I do think that some of the new “cool” ways of doing things are leading under-informed developers into the wrong direction. All too often I have fellow developers suggest a solution to a simple task that somehow requires multiple EC2 instances, a dozen new libraries with their own DSLs, and/or completely changing our deployment strategies and back end tools. If the author had stuck with this nugget from his conclusion:

    You can minimize risk by using the well-proven tool set

    I’d have backed him up 100%.

    1. 27

      The problem is drastically underspecified:

      Here’s a concrete example: suppose you have millions of web pages that you want to download and save to disk for later processing.

      Are you doing this once (reading material for the plane)? A few times a year (archive.org)? As often as you can (google)?

      Do you care about getting them all, or is it okay if the ones that produce errors at the moment the scraper hits them are missed entirely? (That’s one of several ways the logging concern becomes relevant.)

      What if some pages produce errors that have 200 response codes? What if one of the pages is trying to respond with as many digits of pi as you are willing to listen to? What if one site takes coordinates in the Mandelbrot set as URL parameters, and returns a text rendering of the relevant area, so that it has infinitely many pages, although each one is small? (Things like all of these exist.)

      Where are you storing these files? Do you have a good estimate for how large they’ll be, or is there a chance you’re going to run out of space 1% through?

      What is the “later processing”? If you’re running a word concordance and then discarding the full-texts, your output is going to be orders of magnitude smaller than your input; why are you storing anything instead of processing it immediately?

      If you expect your job to take a month to finish, and your machine crashes 24 days in, do you have a way to resume where you left off?

      What’s the budget for this? Network, CPU, and disk space aren’t free.

      I can imagine an objection: None of this was in the original problem description; if you need to deal with those cases, you have a different problem and Hadoop might well be a good tool for that. Fine, but the author didn’t consider any of these questions - and nobody actually wants to download millions of pages without some answer to them. If it had said “thousands of pages”, I’d give it a pass.

      I sympathize with the desire to create software out of small composable tools, though one has to read into it a lot to get that from this article. But this is a task that’s too big for off-the-shelf tools, in any event.

      I would add that Hadoop and things like it are frameworks to allow developers to create their own composable tools. The relevant comparison isn’t between Hadoop itself and the one-liner; it’s between Hadoop and the bash shell. If we assume some other problem that can actually be solved by both, I’ll take the one with fewer undocumented, hard-to-debug parsing subtleties, thanks.

      1. 8

        I don’t think Hadoop or rabbitmq or whatever are that much help. From observation, some people definitely write message queue crawlers that look like this:

        While dequeue:
            If not status 200:
                Requeue
        

        I’d rather have the xargs shotgun crawler hit my site than the “smart” crawler that eventually pounds some page that doesn’t exist in a tight loop.

        1. 1

          That’s a pretty fair point. :)

          Luckily, this isn’t a problem that anyone actually tries at this scale without having the expertise to do it properly!

        2. 8

          The trick is that, with those small tools, it’s okay that the problem is underspecified.

          A good bash jockey can produce five or ten prototypes before the first cut of the request for specifications document is out the door.

          I appreciate your desire for more detailed specs, but they are not needed in the initial phases of figuring out if something is worth doing.

          1. 5

            Running the shell version for a day or two and seeing how far it’s gotten is definitely a good start at producing time and space estimates for the real thing, since this is always going to be network-bound, even after being completely reimplemented.

            Other than that, I don’t see that the shell version is even a valid prototype here. There is no way to evolve it to address any of these concerns.

            1. 9

              There is no way to evolve it to address any of these concerns.

              This is plainly incorrect.

              Are you doing this once (reading material for the plane)? A few times a year (archive.org)? As often as you can (google)?

              Cron job and manual invocation cover this range.

              What if some pages produce errors that have 200 response codes?

              That’s garbage data in, garbage data out. Not the script’s problem, and if it is, we can grep and test and skip.

              What if one of the pages is trying to respond with as many digits of pi as you are willing to listen to? <snip similar concerns>

              You can set timeouts and retry counts on wget (for example), so this need not kill a script. The mandelbrot example is obtuse, and why the hell would you scrape it knowingly?

              Where are you storing these files? Do you have a good estimate for how large they’ll be, or is there a chance you’re going to run out of space 1% through?

              This is typically handled through environment vars or command-line arguments. It doesn’t matter whether the estimate is good or even existent–storage is cheap, and I’d rather waste time rerunning the script with a larger HDD than deadlining a developer for a more “rigorous” solution.

              What is the “later processing”? If you’re running a word concordance and then discarding the full-texts, your output is going to be orders of magnitude smaller than your input; why are you storing anything instead of processing it immediately?

              Same answer as before: it doesn’t matter whether we throw away 99% of the results if it means that we don’t waste developer time.

              If you expect your job to take a month to finish, and your machine crashes 24 days in, do you have a way to resume where you left off?

              Totally addressable with (hacky) directory structures, timestamps, and any number of other workarounds. Also, what are you doing that actually takes a full month of machine time (if you aren’t Google)?

              What’s the budget for this? Network, CPU, and disk space aren’t free.

              Agreed that they’re not free but, for the class of problems that can be solved with a shell script, they’re a damned sight cheaper than an FTE.

              ~

              Worse is better.

              http://www.leancrew.com/all-this/2011/12/more-shell-less-egg/

              1. 8

                The case I’m making is “the script doesn’t solve the problem, because this is a much harder problem than the author realizes, even if one very carefully constructs the easiest possible case”. Your points appear to be assuming that the author really meant a smaller, simpler problem. I think a lot of them are irrelevant because they were coming from a different premise, but the ones that remain…

                You can set timeouts and retry counts on wget (for example), so this need not kill a script. The mandelbrot example is obtuse, and why the hell would you scrape it knowingly?

                Of course you wouldn’t scrape it knowingly. You’ve vetted “millions of pages” by hand to make sure they contain nothing like that? That’s at least tens of thousands of sites, probably hundreds of thousands (the average site is very few pages).

                Totally addressable with (hacky) directory structures, timestamps, and any number of other workarounds. Also, what are you doing that actually takes a full month of machine time (if you aren’t Google)?

                Downloading 0.1% of the web, according to the problem description.

                http://www.worldwidewebsize.com/ estimates the full web is 4.7B pages. A tenth of a percent of it is still an enormous amount of data. Most of the wall time is going to be waiting on network responses, of course, so doing this as cheaply as possible would probably involve using many machines with excellent downlinks but not much CPU.

                Agreed that they’re not free but, for the class of problems that can be solved with a shell script, they’re a damned sight cheaper than an FTE.

                This isn’t a problem that can be solved with a shell script. A much, much smaller version of it could be.

                1. 4

                  There are a simply astounding numbers of problems that are of tangible value to solve that are well within shell-script distance.

                  There are far fewer that are important and which require genuine programming and (especially in the manner of times) complicated systems engineering.

                  Remember, the choice here isn’t between a command-line pipeline and a quick Python or Haskell script…it’s between simple use of standard ‘nix utilities and large over-designed and heavyweight software projects to solve problems.

                  1. 5

                    I very much agree that it’s better to reuse existing tools when they are relevant.

                    I was never arguing with that.

                  2. 2

                    The case I’m making is “the script doesn’t solve the problem, because this is a much harder problem than the author realizes, even if one very carefully constructs the easiest possible case”.

                    This is possible, but the converse is also very often true: There are a huge number of cases where a script trivially solves the problem, but people seem to insist on spending days developing very complex code to solve them.

                    1. 1

                      There certainly are some. I suspect how common they are depends on the types of problems you work on. In areas where they aren’t common, we should be focusing on creating shell-like composition systems that can pay such dividends in future.

            2. 2

              I would add that Hadoop and things like it are frameworks to allow developers to create their own composable tools.

              Yes, and no. An example: one problem we’re running into in my current job is that we want to process the data in our database for analytics. We have the database, we have the database source code, and we have Hadoop. Unfortunately, the database source code is almost entirely useless here because of HDFS. Hadoop has made it almost impossible for us to use the existing source code to work with the database.

              I’m not just saying that Hadoop sucks, but also that it falls into this, very common IMO, failure of frameworks (and why so many people hate frameworks): do it inside my house by my rules or GTFO. I cannot go from my little prototype to a Hadoop solution, I have to reimplement my solution in Hadoop. Wanting a Hadoop job as a step in my existing job requires a whole new level of abstraction that is different than the one I’m currently using. Contrast this with a more “Unix zen” solution: Manta from Joyent. It takes the abstraction of a shell script and lifts it up so you can use the same abstraction, now distributed over tons of machines and more data than can fit on a single machine.

              Shell scripts probably are not the answer in the long run, but, IMO, a lot of frameworks make it near impossible to have any hybrid solution. One has to fully buy in or stay completely away. The more Java I have to write the more time I have to spend getting the whole thing to work because one little portion of it is broken and I cannot just outsource doing that work to something else for now. Everything becomes so large and complicated because it has to do everything because it won’t let one do work outside of it. I think that is the real take away from small composible units: make the OS your production environment, not just this thing that happens to run it.

            3. 10

              As always, this argument ignores maintainability. Not much in the way of logging in his string of piped commands.

              1. 17

                That’s … an interesting objection. I don’t see how 4k lines of code is inherently more maintainable than one line of code.

                If logging, for example, is really a concern, turn the one-liner into a five-liner (or whatnot), and involve utilities like tee. These are problems that have, for the most part, been thought through over the past 4-5 decades.

                If your problem is beyond shell utilities (or evolves to that point), then by all means write code.

              2. 10

                I have far more faith in xargs than I do in Hadoop.

                I don’t. xargs does complex parsing and escaping in C because it has to use strings to communicate with other processes.

                “10 lines of shell script” can be completely unmaintainable; shell has mutability everywhere, operations that change behaviour based on random environment variables and also config files scatted across the system, no unit tests and about one semi-usable datastructure available (the array). The deployment model is probably screen, which is unmaintained and (in my experience maintaining a production system that used it) hangs every so often for no reason. As a programming language it’s roughly on the level of PHP.

                If your programming language can’t do the same thing as those 10 lines of shell in 5 readable lines then you have a problem, sure. Too many languages make it hard to invoke an external program, and too many utilities don’t offer their functionality as a machine-usable library. But that doesn’t mean you have to use shell; in the worst case TCL offers all the functionality of shell but also has the power of a real first-class programming language.

                1. 12

                  “I don’t. xargs does complex parsing and escaping in C because it has to use strings to communicate with other processes.”

                  Hadoop has 1.9 million lines of code (https://www.openhub.net/p/Hadoop), including many, many parsers.

                  While it’s not the case that ∀x∀y x simpler-than y → x better-than y, it is a very strong implication that doesn’t deserve a ‘both sides do it’ dismissal.

                  1. [Comment removed by author]

                    1. 3

                      I’ll fight you for that cookie. The invariants hard-guaranteed by (the coq/idris/agda equivalent of) (1.9 million lines of java) with correctness proofs could be super valuable for certain use cases.

                      1. [Comment removed by author]

                        1. 1

                          I’ll agree with that one, for sure, given that there have been formally proven pieces of software that have been found to have bugs which had correlated bugs in the proof. Let’s split the cookie.

                      2. 1

                        You’re almost certainly not the biggest fan of formal verification here. I just tend not to bring it up in discussions about the short-term future, because… yeah. :)

                      3. 1

                        I don’t dismiss xargs, but it very much has problems. In addition to lmm’s remark, I’d add that it has portability issues.

                        1. 12

                          I’m much concerned about the portability of Hadoop than xargs. There’s three computers in just my living room that have at least a basic version of xargs but for which there’s no JVM.

                          1. 4

                            I’d be more concerned about having a “basic” version of xargs that might silently behave differently. The JVM is much more fail-fast - either you have one, in which case the program will reliably behave the same (yes there are platform-specific JVM bugs, but they’re exceedingly rare), or you don’t.

                            (And even my old SGI box has a JVM, so I’m curious what computers these are)

                            1. 4

                              Mostly “second-tier” openbsd machines, though that includes a 64-way 32GB T2 sparc which, ironically, was almost specifically designed to run massive parallel programs like hadoop. openbsd has jvm for a few platforms, but not all, and it’s a nontrivial undertaking to bring up a new one. I don’t have plans to run Hadoop on such machines (another is an beaglebone, which could maybe run the jvm, but piss poorly) so this isn’t a real concern.

                              Portability of software is crazy sometimes. A tiny step away from “mainstream” results in a precipitous drop in available software, even though it’s just another flavor of the same underlying design.

                              1. 1

                                a 64-way 32GB T2 sparc which, ironically, was almost specifically designed to run massive parallel programs like hadoop

                                Well presumably it was designed to run Solaris, under which I’d imagine it will run Hadoop very well.

                                1. 1

                                  Very true, but I can’t afford that… ;)

                          2. 7

                            ∀x has-problems x. But who cares? The question is, is there a class of x for which it is generally true that x is better-than y, where ‘better-than’ covers the broad spectrum of understood value including developmental, operational, time and cost concerns, and what are the characteristics of that class? And the answer is, that the quality of being simpler seems to map very well to that class as a rough rule of thumb.

                            This is a great, underappreciated, and important finding that the current computing generation is completely missing. Let’s talk about that!

                            Who cares about xargs.

                            And even if you do, then include gnu xargs as a dependency and call it explicitly, and your portability issues are solved.

                            1. 8

                              This is a great, underappreciated, and important finding that the current computing generation is completely missing. Let’s talk about that!

                              Here here!

                              There is little emphasis these days on simple systems.

                              My theory is that this is due to the current tech climate: on the one side, you’ve got enterprise software where large baroque platforms are the norm and consultants and architects drive things along…and on the other side, a lot of startups who are told by good marketing and VC money to use tools that promise speed of development while discouraging simplicity.

                              If you are being told to use Amazon for everything, it of course makes sense to use tools that automate that away, and if you are using those tools, it makes sense to use other tools, and so forth and so on until you are on a trembling tower of complexity.

                              1. 7

                                Yes, exactly. We’re now in a situation where some of my clients feel like they have to use Kafka for queuing because it’s the standard that all the big data companies use, but on AWS, which means they’re running Zookeeper on AWS, which means their cluster falls over and requires manual intervention several times a year, and they don’t have distributed systems experts because they thought AWS means you don’t need ops or specialist devs. It’s a “great” situation for me, but there’s this incredible disconnect between conference reality and practical reality that must be costing the tech industry billions a year right now.

                                The intriguing question is whether/how this hole can ever be dug out of.

                                1. 2

                                  disconnect between conference reality and practical reality

                                  Upvote for “disconnect between conference reality and practical reality,” thanks for that.

                              2. 2

                                I agree with that proposition, and I think anyone who’s willing to engage in discussion about it at all probably agrees. I don’t think it’s under-appreciated, I just think there are a variety of reasons why we keep getting complicated solutions anyway. I think you do younger programmers a disservice to say they’re missing it, and I’m not sure offhand where that idea comes from.

                                1. 8

                                  Please do not put words in my mouth. I didn’t say younger; I said computing generation, by which I meant all participants in the current generation-of-computing, some of whom are old, some of whom are young.

                                  And they’re missing it by simple inspection; see, e.g., posts in this very forum talking about, e.g., using docker to distribute command line tools; node as an operating system. See, e.g., https://github.com/erikras/react-redux-universal-hot-example. See every node framework. There are innumerable cases where someone has decided to try and replace the foundations without understanding them first (e.g., systemd; mongodb; node.js); or added to a pile of complexity by introducing still more complexity. There will be fifty new posts today on HN continuing to add to this nonsense.

                                  1. 7

                                    Sorry about the misunderstanding. I couldn’t think what else that would mean - I guess everyone developing today, including yourself? I suppose exactly how you’re defining it isn’t important. (Edit: I realized that you appear to mean technologies developed since the bubble ended in 2000, since that was all of your examples. Fair enough.)

                                    But you’ve given a more interesting thing to talk about, so let’s. :)

                                    As a general remark, I reiterate that I don’t think contemporary programmers categorically under-value simplicity - though that boilerplate repo is not one I’d personally defend, given how it provides so many alternatives that do essentially the same thing - but I do think it’s getting increasingly difficult to actually deliver it.

                                    As a disconnected point before I jump into what I really want to say, Mongo is motivated entirely by a perception that SQL is too complicated. It’s right… for a fraction of cases. SQL solves some very real, essential problems, and the typical workarounds to deal with what’s lost by using a non-transactional key-value store are pushing enormous complexity to the downstream software. But at least the motivation is simplicity, even though the idea that this is simpler is a profound misunderstanding. At least I haven’t heard people say “NoSQL” as if it were a self-explanatory goal, recently.

                                    I think the point I actually feel strongly about is regarding the “replace the foundations” cases, in which I include “node as an OS”. I see the motivation of every one of these examples as being simplicity. The result does generally wind up being at least as complicated as the original, not least because, as you say, many of these are reactionary projects specifically seeking to get rid of old things because nobody today understands them. I vaguely recall some nice discussions on lobste.rs in the past about how important it is to understand an existing solution before replacing it, which is an ideal I completely agree with.

                                    Unfortunately, our old tools are also complicated, and badly documented! Systemd is a replacement for init.d or rc.d or whatever variant you want to name. It’s a contentious topic, so I don’t really want to start a fight over my belief that existing init systems have been impenetrable to outsiders for at least a decade, but I would at least like to say that there are legitimate differences of opinion over whether what we have is simple or not.

                                    And… sometimes people try to replace the foundations without understanding them, because the foundations are too opaque to understand. Sometimes they do exploratory work to understand the problem-space, and what they come up with turns out to address most of the scenarios they care about, so they run with it. Sometimes the new stripped-down thing even does really address all the scenarios that exist today. The thing about thirty-year-old code is that inevitably at least some of it is dealing with problems that haven’t existed for twenty years, and there isn’t necessarily anyone to ask who understands it well enough to figure out what those ever were.

                                    I’d like to add, also, that I don’t see node frameworks as excessively complicated. The motivations for them are perhaps not obvious, since indeed they’re often promoted to an audience who’s never used anything else, and I do find that sort of promotion to be a questionable practice. But I’ve certainly seen good explanations of what these frameworks are meant to help with, and how they do, and actually I think even the complicated ones like Ember and Angular are solving real problems. Any sort of library is trying to handle complexity so its users don’t have to. Not all potential users actually need that complexity, but for those who do, it has to live somewhere. Putting it in a library lets people focus on their immediate problems, and also makes it easier to work on someone else’s code.

                                    Or, at least, that’s the ideal. I know of many counterexamples, it’s just that you really seem to be making a very absolute statement, and I don’t think that’s warranted.

                                    Elsewhere in this thread, people have mentioned “composable” as well as “simple”. Those definitely don’t imply each other, and are both important. I feel like there are a few other architectural virtues that it would be easy to agree on, but that also seems like a tangent. Most of this discussion applies equally well to any virtue you pick.

                                    Old doesn’t automatically mean simpler. New doesn’t automatically mean simpler. And simpler is only good most of the time. And whether a solution is as simple as it could be isn’t obvious unless we also take the time to understand the problem.

                                    Please definitely feel free to disagree with any of the above. It’s worth talking through.

                                    1. 3

                                      Lobsters (and maybe the current web) makes it hard to have a long, literate, discursive conversation, so I can’t do proper justice to your remarks; not out of lack of respect, but out of technical and time limitation.

                                      That said, it sounds like we end up in agreement. I agree that old doesn’t automatically mean simpler (e.g., CORBA). I agree that new doesn;t automatically mean simpler. And I agree that simpler is “only” good most of the time; and that we have difficulty getting to simplicity without understanding the problem. The key for me isn’t old or new, but that simpler is good most of the time. And, that the current culture of programmers does not seem to recognize this fact, as memorialized by, e.g., HN, docker, systemd, etc.

                                      Thanks for the thoughtful discussion.

                                      1. 2

                                        That’s all quite fair. Thank you likewise.

                            2. 0

                              Hadoop has 1.9 million lines of code (https://www.openhub.net/p/Hadoop), including many, many parsers.

                              Sure, but how many of them are in a non-memory-safe language?

                              it is a very strong implication that doesn’t deserve a ‘both sides do it’ dismissal.

                              I’m not claiming “both sides do it”, I don’t know where you’re getting that from. I literally trust Hadoop more than I do xargs.

                              1. 8

                                Sure, but how many of them are in a non-memory-safe language?

                                You know, it is possible to write correct C. People often seem to forget this.

                                1. 3

                                  Possible and likely are two very different things. It’s possible to walk across a tight-rope hundreds of feet in the air but I sure as hell don’t want to try it.

                                  1. 4

                                    And yet some people manage to do it every single day.

                                    1. 1

                                      How do they know?

                                      Serious question. The first example of C I can think of that is fairly clearly mostly-correct is sqlite, which has a very rigorous test suite. And even that had some crashes that went undiscovered for a long time.

                                      1. 3

                                        for sufficient values of ‘a long time’, software can be considered correct enough. Sqlite and Redis are both written in C, and receive, across the entire installed base, easily billions and possibly trillions of operations per second each. I don’t remember ever seeing a redis crash that wasn’t caused by something else, and sqlite’s crashes were only revealed by a fuzzer generating input not found in the wild.

                                      2. 1

                                        Some people, yes. And for the rest?

                                  2. 6

                                    This has the makings of an interesting challenge. You try to find an input that breaks xargs. I try to find an input that breaks hadoop. First breaker wins. :)

                                    1. 1

                                      If we’re comparing equivalent systems (i.e. two implementations of the problem given in the article, one using xargs and one using hadoop - if you’re just trying to find any input that breaks some hadoop system under some circumstance then sure you’ll find it quicker, because hadoop does so much more, but that’s not the question that matters when choosing which to use to solve a given problem), and we both have to find a true security vulnerability (i.e. not just a DoS), then I’d be up for that.

                              2. 4

                                I’m not really sure if “..if I had only manned up and broken out sed, but I pussied out and wrote some Python.” is really appropriate to say period, let alone as a subtext of UNIX culture. In summation, this author is a meatball.