Eh, I’m not sure if this is nitpicking or not, but the example isn’t targeting the OS. It’s targeting tools that are common on the OS: ssh, xargs, sort, uniq etc. In that light, those tools aren’t really any different from any other tool, including custom code and whole platforms like hadoop.
That’s not to say there isn’t validity in using simple tools when they work. You don’t always need a hammer. But it’s not really “targeting the OS”, it’s targeting “simple tools commonly found on the OS”. But maybe I’m just splitting hairs :)
And the debate “what is the operating system” continues. For some, it’s just the kernel. For others, it’s the Kernel + user land. If it’s the kernel + user land, then POSIX is a valid “OS”, as it contains most of the utilities you question (ssh being the difference).
The trouble is that the OS - when that OS is unix - is very much oriented towards passing around streams of bytes and reinterpreting them willy-nilly. Using the OS is exactly how Shellshock happened, and to continue to use it this way is to invite similar vulnerabilities in the future.
I could have solved it in Python, figuring out a library for doing SSH, hoping it’s async or using a bunch of threads and using in-memory data structures to store the data. But parallelizing the whole thing was a matter of adding -P10 to the xargs command, almost too easy. I had some edge cases in what the data looked like, so having the data from each phase on disk made debugging and doing the next phase easy. By embracing the OS, I actually got a smoother experience than I likely would have otherwise.
The composability of unix utilities is to be applauded. This experience is what every language should be aiming for. But there’s no reason you can’t accomplish this in a safer language. In Scala I’d get close - parallelizing would just be a matter of dropping in a .par, I could handle edge cases in the REPL. No doubt Haskell or others could get closer. It’s possible to do this without sacrificing safety.
What do you mean by that? In my view bash != OS, bash is just a crappy shell that uses a crappy scripting language and that was used in a crappy way to run CGI programs. Most of the shellshock impact was derived from poor practices, not from using simple programs that take advantage of the OS functionality.
You will run any program in the OS anyway, I think the merit of the proposed solution is that it composes from simple pieces rather than building a unique, but larger and more complicated one.
But how else does the composition work? The article talks about composing things like xargs, sort, uniq, with ssh for distributing the work. That puts you solidly in shell-scripting territory. At the very least, in territory where you have to be careful about the finer points of POSIX rules about quoting, record terminators, etc.
Composition works by:
connecting input and output streams of components
using files for persistency
That is often done with shell scripts, but is not mandatory. I used that approach a lot using erlang ports, then erlang acts as orchestrator. You could still launch the processes you compose using a shell (per process), which is typically done just as an easy way to control file descriptor redirections, but quoting would only be a problem for the shell arguments. You can even avoid that by execing instead of using a parent shell.
I don’t think that is specific to erlang, you can use the same idea with popen, for example, though I recall some popen implementations to be a bit quirky, and elrang concurrency approach make it more natural to compose several small tools into higher level applications.
All in all, I discourage any kind of complicated shell script, but I believe composing small, focused programs is a safe and nice approach for many common problems.
If you take it really generally as a programming model, I can buy that. I read the original article as arguing something different, more about using the classic (and built in, unlike erlang) Unix approach to processing based on pipes, I/O redirection, and a suite of built-in utilities. Especially the example of xargs is really hard to make sense of outside of that context. If you were writing erlang, I’m having trouble imagining why you would ever call xargs; it’s really a tool for use in shell pipelines.
Yes, xargs is probably specific to shell scripting. One thing I like about this approach is that it is possible to debug the individual components (and even test them) in the shell. I that case tools like xargs, tee, etc are handy.
The proposal is to do composition via pipes and the filesystem (presumably writing a script to orchestrate them). I don’t think this this is fundamentally more “composing from simple pieces” than a program written using something like fs2 style would be.
In my view bash != OS, bash is just a crappy shell that uses a crappy scripting language and that was used in a crappy way to run CGI programs. Most of the shellshock impact was derived from poor practices, not from using simple programs that take advantage of the OS functionality.
If you’re doing composition via pipes and/or the filesystem then what you transfer between successive stages necessarily has to be streams of bytes. Any control instructions have to be mixed with data via “magic” data values - exactly what lead to shellshock. I agree that it’s poor practice, but using the OS functionality necessitates that practice.
I don’t think shell scripting is necessarily a part of the proposal. My understanding is that the relevant bit is composing small programs that work over bytestreams and bypassing any complicated storage as long as the file system is good enough.
I am not familiar to fs2, but my understanding after a quick perusal is that it implements the same concept, but instead of using independent processes it provides functions native to the orchestrating language (scala in this case). The different there is that you don’t get process isolation, which I think is valuable if you can get away with it (sometimes is not possible for performance reasons).
What lead to shellshock, in my view, was an old dubious feature of a given, and not particularly safe, shell. That feature basically involves executing arbitrary streams as code. The main shellshock exploit, that is Apache’s CGI further involves exporting unsanitised user input (http headers) as variables to be used by a shell with potentially high privileges. I cannot find a clear link from that to implying that independent command composition over data streams is an unsafe practice, all the steps that led to shellshock are bad practices yes, but also not the practices you need to compose programs in “unix style”.
OS-level process isolation only really provides memory isolation IME, which is a lot less important in a memory-safe language. (File handles and I/O bandwidth can still be an issue; the OS does share CPU up to a point but it’s very easy to get priority inversions and the like where suddenly you do need to be concerned about what the other processes are doing. You can use processes for resource management up to a point (i.e. rely on exiting to close file handles) but I don’t think that’s any easier than doing it in a programming language - you have to worry about the worst-case for how long your processes live and the like).
A zero-shared-memory model is a big advantage for consistent performance, but as you say it means you incur the overhead of serializing everything; even more than the runtime performance, until recently I found that serialization of sum types (tagged unions) was unpleasant at the code level in the main options (json/thrift/protobuf). Maybe it’s got better. If you allow your processes to share memory then that works but you’re not getting much advantage from the process isolation. I understand the Erlang runtime manages green processes that have an explicit distinction between shared and non-shared heaps; I don’t have the experience to know if that’s useful in practice, but it seems like it shouldn’t be in theory - compared to Scala et al you’re moving some things out of the shared heap into private heaps, but as long as you’re using a shared heap you will have the problems of shared heaps.
Arbitrary streams are the only thing that unix has, so everything is passed around as them, necessarily including both user input and things that will be executed as code. I took the original post to be arguing for using specifically unix pipes and unix utilities - because if it’s just advocating composing together small reusable components then you can do that equally well in Python. In any case I think you and I agree on what the best practice is.
I do think that memory isolation is very important, as it is process isolation. That keeps failure modes way more contained. Type safety across components would be a nice addition, and you can get type safety within the components just by choosing the right language.
Type safety in the protocols the components use to communicate is more or less difficult depending on how much control you have over them, but you can always engineer the protocols so that they are checked against contracts, at the very least.
What having these small, isolated components as building blocks buys you is simplicity. Thus, it is easier to find out which part is misbehaving if something goes wrong, you can just take that part and poke with it catting to it, grepping from it, you can watch how many resources it consumes with top, poke at the descriptors it has opened in /proc, kill it if needed, etc, etc. Plus they tend to compose very well with just reasonable designs. Or at least that is my experience.
A minor clarification about Erlang. The Erlang programming model presents you with memory isolation, processes share nothing. The VM does share large binaries, but that is not visible from the code (unless some very specific situations where performance can be affected if certain programming patterns are not followed). Still, an Erlang application is still a single OS process, so from the perspective of this post, if you programmed the logic in Erlang (instead of ugly shell scripting), you’d have the Erlang process, and all the other components running as separate processes. I’ve built a couple of systems that way and I think it worked great