Consume input from stdin, produce output to stdout.
This is certainly a good default, though it’s helpful to also offer the option of a -o flag to redirect output to a file opened by the program itself instead of by the shell via stdout redirection. While it’s a small degree of duplication of functionality (which is unfortunate), it makes your program much easier to integrate into makefiles properly.
Without a -o flag:
bar.txt: foo.txt
myprog < $< > $@
If myprog fails for whatever reason, this will still create bar.txt, resulting in subsequent make runs happily proceeding with things that depend on it.
In contrast, with a -o flag:
bar.txt: foo.txt
myprog -o $@ < $<
This allows myprog to (if written properly) only create and write to its output file once it’s determined that things are looking OK [1], preventing further make runs from spuriously continuing on after a failure somewhere upstream.
(You can work around the lack of -o with a little || { rm -f $@; false; } dance after the stdout-redirected version, but it’s kind of clunky and has the disadvantage of deleting an already-existing output file on failure. This in turn can also be worked around by something like myprog < $< > $@.tmp && mv $@.tmp $@ || { rm -f $@.tmp; false; } but now it’s three times as long as the original command…might be nice if make itself offered some nicer way of solving this problem, but I’m not aware of one.)
[1] Or preferably, write to a tempfile (unlinking it on failure) and rename it into the final output file only when completely finished so as to avoid clobbering or deleting an existing one if it fails partway through.
Yes, every tool should have a custom format that needs a badly cobbled together parser (in awk or whatever) that will break once the format is changed slighly or the output accidentally contains a space. No, jq doesn’t exist, can’t be fitted into Unix pipelines and we will be stuck with sed and awk until the end of times, occasionally trying to solve the worst failures with find -print0 and xargs -0.
JSON replaces these problems with different ones. Different tools will use different constructs inside JSON (named lists, unnamed ones, different layouts and nesting strategies).
In a JSON shell tool world you will have to spend time parsing and re-arranging JSON data between tools; as well as constructing it manually as inputs. I think that would end up being just as hacky as the horrid stuff we do today (let’s not mention IFS and quoting abuse :D).
Sidestory: several months back I had a co-worker who wanted me to make some code that parsed his data stream and did something with it (I think it was plotting related IIRC).
Me: “Could I have these numbers in one-record-per-row plaintext format please?”
Co: “Can I send them to you in JSON instead?”
Me: “Sure. What will be the format inside the JSON?”
Co: “…. it’ll just be JSON.”
Me: “But it what form? Will there be a list? Name of the elements inside it?”
Co: “…”
Me: “Can you write me an example JSON message and send it to me, that might be easier.”
Co: “Why do you need that, it’ll be in JSON?”
Grrr :P
Anyway, JSON is a format, but you still need a format inside this format. Element names, overall structures. Using JSON does not make every tool use the same format, that’s strictly impossible. One tool’s stage1.input-file is different to another tool’s output-file.[5].filename; especially if those tools are for different tasks.
I think that would end up being just as hacky as the horrid stuff we do today (let’s not mention IFS and quoting abuse :D).
Except that standardized, popular formats like JSON get the side effect of tool ecosystems to solve most problems they can bring. Autogenerators, transformers, and so on come with this if it’s a data format. We usually don’t get this if it’s random people creating formats for their own use. We have to fully customize the part handling the format rather than adapt an existing one.
Still, even XML that had the best tooling I have used so far for a general purpose format (XSLT and XSD in primis), was unable to handle partial results.
The issue is probably due to their history, as a representation of a complete document / data structure.
Even s-expressions (the simplest format of the family) have the same issue.
Now we should also note that pipelines can be created on the fly, even from binary data manipulations. So a single dictated format would probably pose too restrictions, if you want the system to actually enforce and validate it.
XML and its ecosystem were extremely complex. I used s-expressions with partial results in the past. You just have to structure the data to make it easy to get a piece at a time. I can’t recall the details right now. Another I used trying to balance efficiency, flexibility, and complexity was XDR. Too bad it didn’t get more attention.
“So a single dictated format would probably pose too restrictions, if you want the system to actually enforce and validate it.”
The L4 family usually handles that by standardizing on an interface, description language with all of it auto-generated. Works well enough for them. Camkes is an example.
One might argue that it’s too flexible or too powerful, so that you can solve any of the problems it solves with simpler custom languages. And I would agree to a large extent.
But, for example, XHTML was a perfect use case. Indeed to do what I did back then with XLST now people use Javascript, which is less coherent and way more powerful, and in no way simpler.
The L4 family usually handles that by standardizing on an interface, description language with all of it auto-generated.
Yes but they generate OS modules that are composed at build time.
Pipelines are integrated on the fly.
I really like strongly typed and standard formats but the tradeoff here is about composability.
UNIX turned every communication into byte streams.
Bytes byte at times, but they are standard, after all! Their interpretation is not, but that’s what provides the flexibility.
Indeed to do what I did back then with XLST now people use Javascript, which is less coherent and way more powerful, and in no way simpler.
While I am definitely not a proponent of JavaScript, computations in XSLT are incredibly verbose and convoluted, mainly because XSLT for some reason needs to be XML and XML is just a poor syntax for actual programming.
That and the fact that while my transformations worked fine with xsltproc but did just nothing in browsers without any decent way to debug the problem made me put away XSLT as an esolang — lot of fun for an afternoon, not what I would use to actually get things done.
That said, I’d take XML output from Unix tools and some kind of jq-like processor any day over manually parsing text out of byte streams.
I loved it when I did HTML wanting something more flexible that machines could handle. XHTML was my use case as well. Once I was a better programmer, I realized it was probably an overkill standard that could’ve been something simpler with a series of tools each doing their little job. Maybe even different formats for different kinds of things. W3C ended up creating a bunch of those anyway.
“Pipelines are integrated on the fly.”
Maybe put it in the OS like a JIT. Far as bytestreams, that mostly what XDR did. They were just minimally-structured, byte streams. Just tie the data types, layouts, and so on to whatever language the OS or platform uses the most.
JSON replaces these problems with different ones. Different tools will use different constructs inside JSON (named lists, unnamed ones, different layouts and nesting strategies).
This is true, but but it does not mean heaving some kind of common interchange format does not improve things. So yes, it does not tell you what the data will contain (but “custom text format, possibly tab separated” is, again, not better). I know the problem, since I often work with JSON that contains or misses things. But the problem is not to not use JSON but rather have specifications. JSON has a number of possible schema formats which puts it at a big advantage of most custom formats.
The other alternative is of course something like ProtoBuf, because it forces the use of proto files, which is at least some kind of specification. That throws away the human readability, which I didn’t want to suggest to a Unix crowd.
Thinking about it, an established binary interchange format with schemas and a transport is in some ways reminiscent of COM & CORBA in the nineties.
Doesn’t this happens with json too?
A slight change in the key names or turning a string to a listof strings and the recipient won’t be able to handle the input anyway.
the output accidentally contains a space.
Or the output accidentally contact a comma: depending on the parser, the behaviour will change.
No, jq doesn’t exis…
Jq is great, but I would not say JSON should be the default output when you want composable programs.
For example JSON root is always a whole object and this won’t work for streams that get produced slowly.
Using a whitespace separated table such as suggested in the article is somewhat vulnerable to continuing to appear to work after the format has changed while actually misinterpreting the data (e.g. if you inserted a new column at the beginning, your pipeline could happily continue, since all it needs is at least two columns with numbers in). Json is more likely to either continue working correctly and ignore the new column or fail with an error. Arguably it is the key-value aspect that’s helpful here, not specifically json. As you point out, there are other issues with using json in a pipeline.
In my day-to-day work, there are times when I wish some tools would produce JSON and other times when I wish a JSON output was just textual (as recommended in the article). Ideally, tools should be able to produce different kinds of outputs, and I find libxo (mentioned by @apy) very interesting.
I spent very little time thinking about this after reading your comment and wonder how, for example, the core utils would look like if they accepted/returned JSON as well as plain text.
A priori we have this awful problem of making everyone understand every one else’s input and output schemas, but that might not be necessary. For any tool that expects a file as input, we make it accept any JSON object that contains the key-value pair "file": "something". For tools that expect multiple files, have them take an array of such objects. Tools that return files, like ls for example, can then return whatever they want in their JSON objects, as long as those objects contain "file": "something". Then we should get to keep chaining pipes of stuff together without having to write ungodly amounts jq between them.
I have no idea how much people have tried doing this or anything similar. Is there prior art?
In FreeBSD we have libxo which a lot of the CLI programs are getting support for. This lets the program print its output and it can be translated to JSON, HTML, or other output forms automatically. So that would allow people to experiment with various formats (although it doesn’t handle reading in the output).
But as @Shamar points out, one problem with JSON is that you need to parse the whole thing before you can do much with it. One can hack around it but then they are kind of abusing JSON.
powershell uses objects for its pipelines, i think it even runs on linux nowaday.
i like json, but for shell pipelining it’s not ideal:
the unstructured nature of the classic output is a core feature. you can easily mangle it in ways the programs author never assumed, and that makes it powerful.
with line based records you can parse incomplete (as in the process is not finished) data more easily. you just have to split after a newline. with json, technically you can’t begin using the data until a (sub)object is completely parsed. using half-parsed objects seems not so wise.
if you output json, you probably have to keep the structure of the object tree which you generated in memory, like “currently i’m in a list in an object in a list”. thats not ideal sometimes (one doesn’t have to use real serialization all the time, but it’s nicer than to just print the correct tokens at the right places).
json is “java script object notation”. not everything is ideally represented as an object. thats why relational databases are still in use.
This is certainly a good default, though it’s helpful to also offer the option of a
-o
flag to redirect output to a file opened by the program itself instead of by the shell via stdout redirection. While it’s a small degree of duplication of functionality (which is unfortunate), it makes your program much easier to integrate into makefiles properly.Without a
-o
flag:If
myprog
fails for whatever reason, this will still createbar.txt
, resulting in subsequentmake
runs happily proceeding with things that depend on it.In contrast, with a
-o
flag:This allows
myprog
to (if written properly) only create and write to its output file once it’s determined that things are looking OK [1], preventing furthermake
runs from spuriously continuing on after a failure somewhere upstream.(You can work around the lack of
-o
with a little|| { rm -f $@; false; }
dance after the stdout-redirected version, but it’s kind of clunky and has the disadvantage of deleting an already-existing output file on failure. This in turn can also be worked around by something likemyprog < $< > $@.tmp && mv $@.tmp $@ || { rm -f $@.tmp; false; }
but now it’s three times as long as the original command…might be nice if make itself offered some nicer way of solving this problem, but I’m not aware of one.)[1] Or preferably, write to a tempfile (unlinking it on failure) and rename it into the final output file only when completely finished so as to avoid clobbering or deleting an existing one if it fails partway through.
GNU make has a
.DELETE_ON_ERROR
special target: https://www.gnu.org/software/make/manual/html_node/Special-Targets.html#index-_002eDELETE_005fON_005fERRORIt’s closer to your first example than the second though.
[Comment from banned user removed]
Yes, every tool should have a custom format that needs a badly cobbled together parser (in awk or whatever) that will break once the format is changed slighly or the output accidentally contains a space. No,
jq
doesn’t exist, can’t be fitted into Unix pipelines and we will be stuck withsed
andawk
until the end of times, occasionally trying to solve the worst failures withfind -print0
andxargs -0
.JSON replaces these problems with different ones. Different tools will use different constructs inside JSON (named lists, unnamed ones, different layouts and nesting strategies).
In a JSON shell tool world you will have to spend time parsing and re-arranging JSON data between tools; as well as constructing it manually as inputs. I think that would end up being just as hacky as the horrid stuff we do today (let’s not mention IFS and quoting abuse :D).
Sidestory: several months back I had a co-worker who wanted me to make some code that parsed his data stream and did something with it (I think it was plotting related IIRC).
Me: “Could I have these numbers in one-record-per-row plaintext format please?”
Co: “Can I send them to you in JSON instead?”
Me: “Sure. What will be the format inside the JSON?”
Co: “…. it’ll just be JSON.”
Me: “But it what form? Will there be a list? Name of the elements inside it?”
Co: “…”
Me: “Can you write me an example JSON message and send it to me, that might be easier.”
Co: “Why do you need that, it’ll be in JSON?”
Grrr :P
Anyway, JSON is a format, but you still need a format inside this format. Element names, overall structures. Using JSON does not make every tool use the same format, that’s strictly impossible. One tool’s stage1.input-file is different to another tool’s output-file.[5].filename; especially if those tools are for different tasks.
Except that standardized, popular formats like JSON get the side effect of tool ecosystems to solve most problems they can bring. Autogenerators, transformers, and so on come with this if it’s a data format. We usually don’t get this if it’s random people creating formats for their own use. We have to fully customize the part handling the format rather than adapt an existing one.
Still, even XML that had the best tooling I have used so far for a general purpose format (XSLT and XSD in primis), was unable to handle partial results.
The issue is probably due to their history, as a representation of a complete document / data structure.
Even s-expressions (the simplest format of the family) have the same issue.
Now we should also note that pipelines can be created on the fly, even from binary data manipulations. So a single dictated format would probably pose too restrictions, if you want the system to actually enforce and validate it.
“Still, even XML”
XML and its ecosystem were extremely complex. I used s-expressions with partial results in the past. You just have to structure the data to make it easy to get a piece at a time. I can’t recall the details right now. Another I used trying to balance efficiency, flexibility, and complexity was XDR. Too bad it didn’t get more attention.
“So a single dictated format would probably pose too restrictions, if you want the system to actually enforce and validate it.”
The L4 family usually handles that by standardizing on an interface, description language with all of it auto-generated. Works well enough for them. Camkes is an example.
It is coherent, powerful and flexible.
One might argue that it’s too flexible or too powerful, so that you can solve any of the problems it solves with simpler custom languages. And I would agree to a large extent.
But, for example, XHTML was a perfect use case. Indeed to do what I did back then with XLST now people use Javascript, which is less coherent and way more powerful, and in no way simpler.
Yes but they generate OS modules that are composed at build time.
Pipelines are integrated on the fly.
I really like strongly typed and standard formats but the tradeoff here is about composability.
UNIX turned every communication into byte streams.
Bytes byte at times, but they are standard, after all! Their interpretation is not, but that’s what provides the flexibility.
While I am definitely not a proponent of JavaScript, computations in XSLT are incredibly verbose and convoluted, mainly because XSLT for some reason needs to be XML and XML is just a poor syntax for actual programming.
That and the fact that while my transformations worked fine with
xsltproc
but did just nothing in browsers without any decent way to debug the problem made me put away XSLT as an esolang — lot of fun for an afternoon, not what I would use to actually get things done.That said, I’d take XML output from Unix tools and some kind of
jq
-like processor any day over manually parsing text out of byte streams.I loved it when I did HTML wanting something more flexible that machines could handle. XHTML was my use case as well. Once I was a better programmer, I realized it was probably an overkill standard that could’ve been something simpler with a series of tools each doing their little job. Maybe even different formats for different kinds of things. W3C ended up creating a bunch of those anyway.
“Pipelines are integrated on the fly.”
Maybe put it in the OS like a JIT. Far as bytestreams, that mostly what XDR did. They were just minimally-structured, byte streams. Just tie the data types, layouts, and so on to whatever language the OS or platform uses the most.
This is true, but but it does not mean heaving some kind of common interchange format does not improve things. So yes, it does not tell you what the data will contain (but “custom text format, possibly tab separated” is, again, not better). I know the problem, since I often work with JSON that contains or misses things. But the problem is not to not use JSON but rather have specifications. JSON has a number of possible schema formats which puts it at a big advantage of most custom formats.
The other alternative is of course something like ProtoBuf, because it forces the use of
proto
files, which is at least some kind of specification. That throws away the human readability, which I didn’t want to suggest to a Unix crowd.Thinking about it, an established binary interchange format with schemas and a transport is in some ways reminiscent of COM & CORBA in the nineties.
Doesn’t this happens with json too?
A slight change in the key names or turning a string to a listof strings and the recipient won’t be able to handle the input anyway.
Or the output accidentally contact a comma: depending on the parser, the behaviour will change.
Jq is great, but I would not say JSON should be the default output when you want composable programs.
For example JSON root is always a whole object and this won’t work for streams that get produced slowly.
Using a whitespace separated table such as suggested in the article is somewhat vulnerable to continuing to appear to work after the format has changed while actually misinterpreting the data (e.g. if you inserted a new column at the beginning, your pipeline could happily continue, since all it needs is at least two columns with numbers in). Json is more likely to either continue working correctly and ignore the new column or fail with an error. Arguably it is the key-value aspect that’s helpful here, not specifically json. As you point out, there are other issues with using json in a pipeline.
Hands up everybody that has to write parsers for
zpool status
and its load-bearing whitespaces to do ZFS health monitoring.On the other hand, most Unix tools use tabular format or key value format. I do agree though that the lack of guidelines makes it annoying to compose.
In my day-to-day work, there are times when I wish some tools would produce JSON and other times when I wish a JSON output was just textual (as recommended in the article). Ideally, tools should be able to produce different kinds of outputs, and I find libxo (mentioned by @apy) very interesting.
I spent very little time thinking about this after reading your comment and wonder how, for example, the core utils would look like if they accepted/returned JSON as well as plain text.
A priori we have this awful problem of making everyone understand every one else’s input and output schemas, but that might not be necessary. For any tool that expects a file as input, we make it accept any JSON object that contains the key-value pair
"file": "something"
. For tools that expect multiple files, have them take an array of such objects. Tools that return files, likels
for example, can then return whatever they want in their JSON objects, as long as those objects contain"file": "something"
. Then we should get to keep chaining pipes of stuff together without having to write ungodly amountsjq
between them.I have no idea how much people have tried doing this or anything similar. Is there prior art?
In FreeBSD we have libxo which a lot of the CLI programs are getting support for. This lets the program print its output and it can be translated to JSON, HTML, or other output forms automatically. So that would allow people to experiment with various formats (although it doesn’t handle reading in the output).
But as @Shamar points out, one problem with JSON is that you need to parse the whole thing before you can do much with it. One can hack around it but then they are kind of abusing JSON.
That looks like a fantastic tool, thanks for writing about it. Is there a concerted effort in FreeBSD (or other communities) to use libxo more?
FreeBSD definitely has a concerted effort to use it, I’m not sure about elsewhere. For a simple example, you can check out
wc
:powershell uses objects for its pipelines, i think it even runs on linux nowaday.
i like json, but for shell pipelining it’s not ideal:
the unstructured nature of the classic output is a core feature. you can easily mangle it in ways the programs author never assumed, and that makes it powerful.
with line based records you can parse incomplete (as in the process is not finished) data more easily. you just have to split after a newline. with json, technically you can’t begin using the data until a (sub)object is completely parsed. using half-parsed objects seems not so wise.
if you output json, you probably have to keep the structure of the object tree which you generated in memory, like “currently i’m in a list in an object in a list”. thats not ideal sometimes (one doesn’t have to use real serialization all the time, but it’s nicer than to just print the correct tokens at the right places).
json is “java script object notation”. not everything is ideally represented as an object. thats why relational databases are still in use.
edit: be nicer ;)