I think we’re very much moving the goalposts when talking about the fragility of the pipeline.
The pipeline does descriptive stuff to nondescriptive text. In other words, “I’d like the second word of every row” is a pipeline. That second word may be what you want or something nobody wants. It’s context-free.
But then we add a business requirement: Oh, I actually want the unique colors of dogs in this list.
That’s not the same question. In fact, it assumes dozens, perhaps millions of tiny assumptions: that the list is animals, that the list is in ASCII, and so forth. There may not be an end to the number of assumptions you can throw into one business request. “I’d like to build a rocket to go to Mars” is only ten words, but think of the assumptions and implied detail in there.
We try to fix that in non-pipeline languages by grouping ideas together into various “validator” patterns and implementing them. For this essay, if it’s got a hyphen it’s not a dog. But there’s really no end to that kind of work as well, and what you end up doing is moving the tests so far off the cognitive radar that you then forget about them. It still breaks, but then you’ve forgotten what it is that’s broken. Instead of looking at one line of code and trying to piece out what went wrong, you’re asking the same question, only perhaps having to dive through thousands of lines of code. This might be great or horrible, but it’s certainly tough on the Mark One noggin.
A better question is probably to ask “Who cares?” So I’ve got this list, I’d like the second word in each row. I think that’s color. Who cares if I’m wrong? So far in our conversation, nobody. So write the dang pipe and move on to the next thing. If you’re building the next Space Shuttle, you’ll have a different answer. To assume that there’s only one answer here and a proper and improper way to code it? Not a good path forward.
The “pitfalls” are kind of a misnomer. A pipeline is a series of programs designed to deal with some kind of data. If you change the data, you gotta change the programs, and I don’t really see that as anything other than the nature of programming.
Shell pipelines are often used to deal with poorly specified data, and it’s easy for bugs to creep in once you stop checking the output. This is a problem shared with spreadsheets.
I agree, but that’s not a fault of the pipeline, it’s more a round-peg-square-hole scenario.
It’s not clear to me that there exist any square pegs in this context. Well-specified data sources that don’t use a recursive format like CSV or JSON. CSV is tempting to cut -d , – until you realize the generator will insert quotes once in a very long while when there’s commas in the data.
cut -d ,
I’m pretty proficient with pipelines. But I always double-check my inputs and outputs. If the data scales beyond my ability to eyeball, I stop using pipelines.
But you can use pipelines without using cut -d ,! There are lots of CSV and TSV utils that do non-naive parsing.
https://github.com/oilshell/oil/wiki/Structured-Data-in-Oil#projects (csvkit, xsv, etc.)
Naive parsing is bad but nothing is forcing you to do it. So this is not a problem with pipelines per se, but the way people use them.
Although I think some support in the shell will help guide people toward non-naive parsing, so Oil should have a small upgrade over TSV which is called QTT (Quoted, Typed Tables): https://github.com/oilshell/oil/wiki/TSV2-Proposal
For sure. Taking it back, I think OP’s point stands that using shell pipelines has pitfalls.
You can use a principled way to solve a problem. Or you can use pipelines. Choose.
By the words „(classic) unix pipeline“ we usually mean not only the pipeline itself (passing stream of bytes from STDOUT of one process to STDIN of another one) but also the classic tools (grep, sed, cut). Rest of the website of this project provides more context…
No I disagree, what I’m saying is that you can use pipelines in either a principled or unprincipled way / correct or naive way.
I don’t see any argument that it’s either-or.
You don’t seem to be disagreeing that using shell pipelines has pitfalls, so I assume we’re disagreeing about what “principled” means. For me it means, when you make a mistake you get an error. You don’t silently get bad data. I fail to see how pipelines permit that. As you said, you could use the correct parser, but it’s also easy/idiomatic to use the wrong one. So I’m curious what “principled” means to you.
Hmm, I suppose you could argue that you can use shell pipelines as long as you’re principled and use the right parser for the data format at each pipe stage.
So I think the disagreement is what OP @franta pointed out in a sibling comment: “classic pipelines” using grep/sed/cut vs. “pipelines” as a general mechanism.
“Classic pipelines” defined that way have pitfalls. You can approximate some transformations on structured data, but they’re not reliable.
But other ways of using pipelines already exist and are not theoretical: csvkit, xsv, which I pointed to above.
It’s up to those tools – not the shell or the kernel – to validate the data. Although I just tested csvkit and it doesn’t seem to complain about extra commas.
I guess that validates me writing my own CSV-over-pipes utilities for Oil, which actually do check errors:
I’ll concede that there’s a culture of sloppiness with text in Unix, but that can be fixed, just like sloppiness with memory safety is a cultural change.
So the point is that pipelines are a great mechanism, but the user space tools need to be improved. I’ve been working on that for awhile :)
Thanks, yeah I’m persuaded that the issue is one of culture rather than anything technical.
Here’s another one that I ran into: if you use cut -c80 to get the first 80 characters of a string, you will fail in subtle ways if your string starts including UTF-8 multibyte characters, since it will get the first 80 bytes. There’s no way to fix this short of converting to a fixed-width representation and then back, which is very silly.
How did I find this out? When the command I was piping it to started failing on invalid UTF-8 input.
On the other hand, GNU awk will just do the right thing. But you have to know this pitfall exists, and it’s embarrassing that GNU cut breaks this way after decades of Unicode being a thing.
GNU cut has -b for bytes and -c is meant for characters, but not yet implemented:
The same as -b for now, but internationalization will change that.
The same as -b for now, but internationalization will change that.
Also, wc has -m option for character count, but head, tail, tr etc work on bytes only.
This is a GNU cut issue, again not the fault of pipelines as we use them - you used an unsuitable program for some kind of input, simple as
It’s certainly not the fault of pipelines, but if the ‘traditional’ pipeline tools are still stuck in a world where multibyte characters don’t exist, that’s pretty bad. It means that if you’re going to be processing text in German, or Japanese, or Russian, then cut -c has a chance to just silently break at some point. Even in English it can fail if you give it input with a word like “passé”!