1. 56
  1.  

  2. 10

    I applaud the author for their efforts, and for writing down their motivation and retrospectives like this. I use a bunch of programs from suckless.org (either their own, or their recommendations) which has a similar philosophy.

    I think a lot of the cruft in software comes from adding more and more code, usually as we’re still trying to figure out the nuances of the problems we’re dealing with, and then failing to prune and consolidate if/when we do gain enlightenment about something. Poor practices probably help to cement such legacy codebases, not necessarily from a project’s developers but from all of the unexpected things that refactoring may break.

    Some more specific thoughts:

    TSV:

    I agree that having tab-separated output would be nice. I make heavy use of the cut and paste commands, which default to TSV. I wouldn’t impose this on everything, but it should be the de facto standard for outputting tabular data, rather than the ad hoc ASCII art that’s become common.

    More generally, I try to only output machine-readable data (e.g. TSV, JSON, s-expressions, Python’s repr, etc.). Lightweight formats like this are easy enough to generate, it prevents me having to care about presentation, and it can make subsequent processing easier (even if it’s just running grep on a log). When I want something to be human-readable, I’ll write a pretty-printer for the machine-readable format (either as a function in the language I’m using, or as a separate tool).

    TAP:

    I see the “test anything protocol” was mentioned. It’s a nice idea (write ok foo to stdout to indicate that foo passed, write not ok foo to indicate that foo failed; other tools can aggregate this data, process it, track, display, etc.). The problem is that there’s not much tooling to handle it; most of the tooling that exists is underwhelming (e.g. adding ANSI terminal colours to the output); most of the tooling requires numbered tests.

    This last point was important for me: the reason I like TAP is that my tests can focus on testing, throwing results to stdout for others to care about. Having to count the number of tests run so far ruins this, since it forces some global state to be tracked (trivial if we’re using a single language with global mutable state running in a single process; tedious if not). Even worse, most TAP tools won’t work without a “plan” specifying up-front how many tests will be run. This means we can’t just fire off arbitrary scripts and sub-scripts; we need some way to “declare” all of our tests before running them (e.g. functions, class methods, whatever), and that needs to be accessible from a single language (so we can count them and print out the plan). If we’re going to do that, we might as well use a “proper” test framework like xUnit or whatever, which is what I inevitably end up doing.

    Note that the TAP specification explicitly states that numbering test outputs is not required (although it’s recommended), so the tools which require this are broken. It seems ambiguous whether a plan is required or not (to me, it seems to be optional but at most once); it also doesn’t need to appear first, so tools requiring that are also broken: (emphasis mine)

    The plan tells how many tests will be run, or how many tests have run. It’s a check that the test file hasn’t stopped prematurely. It must appear once, whether at the beginning or end of the output.

    The plan is optional but if there is a plan before the test points it must be the first non-diagnostic line output by the test file. In certain instances a test file may not know how many test points it will ultimately be running. In this case the plan can be the last non-diagnostic line in the output. The plan cannot appear in the middle of the output, nor can it appear more than once.

    Incidentally, there seem to be lots of libraries for generating TAP output; but the whole point is that it’s trivial, so I’d rather use a print statement (or semantically-named wrapper functions) that a 3rd party dependency.

    1. 7

      More generally, I try to only output machine-readable data (e.g. TSV, JSON, s-expressions, Python’s repr, etc.). Lightweight formats like this are easy enough to generate, it prevents me having to care about presentation, and it can make subsequent processing easier (even if it’s just running grep on a log). When I want something to be human-readable, I’ll write a pretty-printer for the machine-readable format (either as a function in the language I’m using, or as a separate tool).

      I really like JSON for this. It’s lightweight and 100% language agnostic. There are JSON parsers for just about every programming language under the sun, even COBOL ! :)

      1. 6

        One thing that’s very annoying about JSON is that there’s no easy way to encode arbitrary bytes in a human friendly way. Consider, for example, the simple case of putting a Unix (or Windows) file path into a JSON document. How do you do it? You could take the easy way out and just base64 encode it into a string, but now humans can’t read it. JSON isn’t necessarily meant to be consumed by humans, but regardless, it’s still useful to be able to glance it at and get a general idea of what’s in it. (Encoding file paths as an array of integers is similarly opaque.) Windows is similarly problematic, since file paths are an arbitrary sequence of 16-bit integers.

        This also goes for a lot of other things, such as the contents of a file. Does your JSON need to show the contents of a file (or part of a file) in a non-lossy way? Oops. Files in most environments don’t have a defined encoding, and JSON does nothing to help you deal with it.

        1. 2

          I don’t understand what problem you’re having with paths in JSON:

          In [1]: whereizat = {'configs': '/foo/bar/baz/.config', 'whoopie_cusions':'/your/local/jokeshop'}
          
          In [2]: import json
          
          In [3]: json.dumps(whereizat)
          Out[3]: '{"configs": "/foo/bar/baz/.config", "whoopie_cusions": "/your/local/jokeshop"}'
          
          In [4]:
          
          1. 8

            The file paths you’ve used in your example are all ASCII. But this is not true of all file paths. They can contain arbitrary bytes (sans \x00). JSON strings cannot contain arbitrary bytes. Similarly for the contents of files on most file systems.

            1. 1
              In [5]: whereizat = {'configs':"""/🚀foo/bar/baz/.config""", 'whoopie_cusions':"""/your/local/jokeshop😃'"""}
              
              In [6]: import json
              
              In [7]: json.dumps(whereizat)
              Out[7]: '{"configs": "/\\ud83d\\ude80foo/bar/baz/.config", "whoopie_cusions": "/your/local/jokeshop\\ud83d\\ude03\'"}'
              
              

              Also this Stackoverflow entry claims something different from what you assert.

              I’m no expert here. I don’t work with unicode generally.

              1. 3

                I’m no expert here. I don’t work with unicode generally.

                As the author of ripgrep, I do. And in particular, I work in precisely the space where Unicode meets the realities of what file systems support.

                The SO link you provided doesn’t seem relevant.

                The bottom line is so much simpler than what you’re making it out to be: file paths on most file systems in Unix environments are permitted to be arbitrary bytes. JSON, on the other hand, requires strings to be valid Unicode. More specifically, they must be UTF-8, UTF-16 or UTF-32 encoded. Therefore, there are both file paths that cannot be encoded in a JSON string, and there are JSON strings that cannot be used as a file path. For example, "foo\u0000bar" is a valid JSON string, but its UTF-8 encoding is not a valid file name since it contains an interior NUL byte. Conversely, the bytes foo\xFFbar is a valid file name, but cannot be non-lossily encoded directly as a JSON string since \xFF is invalid UTF-8. Thus, you must define some additional encoding on top of JSON strings in order to non-lossily roundtrip all possible file paths through JSON strings. (Alternatively encode them as raw bytes in a JSON array of integers.)

                Both of your examples in your previous two comments are valid UTF-8 and do not contain any interior NUL bytes. Thus, they are in the set of valid file paths that are also valid JSON strings. (Countering my claim via counter-example is barking up the wrong tree anyway.)

                You can see how ripgrep deals with this here: https://docs.rs/grep-printer/0.1.2/grep_printer/struct.JSON.html#text-encoding

                1. 2

                  Thanks for this. I was missing the distinction that JSON supports unicode but not arbitrary bytes. I know, you said that very thing and I was too dense to actually absorb it :) Appreciate the link and the response.

        2. 4

          I usually use JSON for any results with more structure than TSV.

          For error messages I usually use whatever’s quickest in the language I’m using; e.g. in Lisps it might be s-expressions, in Nix it might be toJSON and in Python it might be repr.

      2. 9

        You could do the data interchange thing without rewriting coreutils (and every other program). A structured shell could wrap each utility with a parser to its own object format. Adding support for new tools would be like writing completion scripts, and would not require any changes to the programs themselves.

        I’m kind of tempted by this idea, actually.

        1. 9

          That’s what the plan is for Oil. I’ve written a lot about parsing – not just because parsing shell itself is a problem, because I want the shell language to be good at parsing.

          Parsing is the complement to using unstructured byte streams everywhere. Right now shell’s notion of parsing is $IFS and a subpar regex implementation, which can easily be improved.


          Or rather, the plan is “both”, in the typical heterogeneous style of shell :) I do think some commands need structured variants, e.g. find and xargs, because they’re so central to the shell:

          https://github.com/oilshell/oil/issues/85 (help still wanted on ‘find’ if anyone’s interested)

          But you shouldn’t be required to use new commands – scraping coreutils is also valid, and should be easier.

          There are some more thoughts about structured data in Oil on Zulip:

          https://oilshell.zulipchat.com/#narrow/stream/121540-oil-discuss/topic/Structured.20Data (requires login, e.g. Github)

          Here’s the canonical post about not “boiling the ocean”, e.g. imagining that every tool like git will add structured data:

          Git Log in HTML: A Harder Problem and A Safe Solution

          1. 2

            I’m surprised that most shells after Bash support scripting. I’d love to see a shell optimized for interactive use instead.

            1. 5

              Well, fish is intended to be used interactively? I use it as my daily driver and it’s pretty good at smoothing some of the rough edges.

              1. 2

                What makes fish better for interactive use than other shells?

                1. 3

                  Honestly? Not much, as far as I can tell. It has a nicely orthogonal syntax, and just in general is deeply more sane than POSIX shells.

          2. 4

            I’ve tried to make something in that direction lately: https://github.com/sustrik/uxy

            1. 3

              That looks pretty nice, and close to what I was thinking about for Oil.

              I was thinking of calling it “TSV2”, which is basically TSV where if a field starts with " then it’s interpreted as a JSON string. Every language has a JSON library, so you can reuse that part.

              Although now that I think about it, what @burntsushi said in this thread is right – JSON strings are problematic since they can’t represent arbitrary byte strings, only unicode code points.

              I was also thinking of requiring tabs, which would have the nice property that you could validate the structure simply by counting tabs in a row, and you don’t have to parse the fields within a row. On the emitting side, tabs are easy to emit. The “align” and “trim” commands you implemented would still work for human readable output, of course.

              I was also thinking of an optional JSON types in the header, with the default being string, like:

              name  num_bytes:number
              foo.c   100234
              foo.h   1234
              

              This solves a common problem with working with converting CSVs to data frames in R and pandas. The builtin read.csv() functions try to guess the type of each column, and of course they do it wrong sometimes!


              A few people e-mailed me saying they want Oil to have structured data, and we started a discussion on https://oilshell.zulipchat.com/ . Feel free to join if you’re interested! I laid out 3 concrete examples for structured data: the git log thing, my releases HTML page (which is really generated from a shell script), and Oil benchmarks, which use a combination of shell and R code, particularly xargs.

              I think there needs to be some support in two places – in the shell itself, and in external tools. I also think xargs needs supports for “rows”, which I currently fake with xargs -n 7 for a row of 7 columns.

            2. 3

              I suggest someone try following what separation kernels do: just describe the data format with a secure parser and plumbing autogenerated by a standardized tool. CAmkES is a field-deployed example. Cap n Proto might be ported to this use case for its balance of high speed and security.

            3. 9

              TSV? What happens if your file names have tabs in them?

              (in case you’re curious, the only characters that are banned in Linux file names are null bytes and forward slashes; if there was one thing I’d change in Linux, it would be to ban newlines in file names)

              1. 8

                The understanding I gained was that the basic UNIX tools we compose together using the shell should be more regular and machine friendly. I think UNIX command lines tools should have a completely different system of command line process and instead of using plain text as the universal interface they could use something like tab separated values or ascii delimited values. Being able to do interchange between programs with a well specified structured data format is something very valuable and helpful. Instead we build ad-hoc parsers with awk and sed and such on a daily basis.

                Absolutely. “Everything is a stream of bytes” is an incredibly powerful paradigm whose usefulness has been proven every day by thousands of people (if not millions) for 30+ years.

                However there are other paradigms we should be exploring with UNIX.

                IMO the Object Pipelines paradigm that Powershell uses is one such example, it passes rich objects around with properties you can inspect, filter on, etc.

                Another example I’d love to see carried forward is the ARexx/AppleScript/Powershell example of being able to control applications at runtime from scripts. Many UNIX desktops have some handwave at this - KDE has Kross Gnome3 is doing simiar things with Javascript though I have yet to identify an analagous technology by name.

                1. 5

                  makes reminds me of ninja? Ok, ninja is more complex since it provides more control about the parallelism and supports stuff like compiler generated dependency files.

                  1. 5

                    why ninja? why not samurai ;)

                  2. 5

                    If we spent the time to design a new structured data interchange system for command line tools instead of the fuzzy idea of “lines of text” I think the daily plumbing work of scripting could be much much smoother and easier.

                    I agree with this point. Actually, we don’t even need to design or define a new data interchange system. If command line tools could just have a user-defined output format, that would be great. It enables to easily define a custom and easy to parse output depending on a specific need. for instance, the great tool mediainfo allows that with the --Inform parameter. Calling it from another program to get specific info is a piece of cake.

                    1. 5

                      It took these people four years to fix that Build failure when builddir contains spaces.

                      Sadly it looks like they never actually fixed it:

                      Closing. It’s been discussed in other issues and it’s not going to change; gyp itself doesn’t support blanks in paths.

                      What an absolute failure of a response. If their project cant handle spaces in paths, thats an utter failure in my opinion. For them to just close in this matter is galling.

                      https://github.com/nodejs/node-gyp/issues/65

                      1. 4

                        Perhaps you might want to contribute to https://shellcomp.github.io/

                        1. 3

                          Interesting, that’s basically the same idea I’ve been trying to convince people of:

                          https://lobste.rs/s/eqchm4/shell_completions_pure_rust#c_xibn8r

                          https://github.com/oilshell/oil/wiki/Shellac-Protocol-Proposal

                          However unfortunately it appears dormant and hasn’t gotten very far. I wish someone else would define and “prove” a protocol so I could implement it in Oil :-/

                          https://github.com/shellcomp/shellcomp.github.io

                        2. 3

                          [Make] tries to do everything: for example it includes an incompatable reimplementation of shell…

                          Where is this? I’m not aware of Make (in any form) reimplementing a shell.

                          The kind of ratio here: 80 lines [tests] vs 17195 lines [DejaGNU] for two programs that achieve the same task is something that I’ve discovered is complete normal.

                          DejaGNU is certainly a mess and you shouldn’t use it, but the 80 line script presented is not at all equivalent to what DejaGNU can do. (Maybe it achieves the same task you’re looking to accomplish, in which case, great. If your needs start to expand—and they inevitably do—that 80 line script is going to get pretty hairy.)

                          Aiming for simplicity is a good idea and a worthy goal. That said, software gets complex for a reason: once it’s out there people build things on top of it and are reluctant (or outright refuse) to change their interfaces. Any design or implementation problems that don’t show up for a while become features, not bugs. So if you change it for any reason, your downstream consumers will scream bloody murder.

                          An example: with something at my job that we ship, we shipped it with a name that was wrong/misleading. After the fact there was an effort to correct it. Those who were using it said doing that broke a bunch of their scripts and to not change it. So it still has that name that basically makes no sense, and there’s a lot of cruft in the code to deal with it. And that’s a simple example.

                          Compilers have a plethora of options and settings for similar reasons.

                          1. 2
                            1. A lot of UNIX tools do outright use structured data, and specifically TSV. ‘Plain Text’ doesn’t mean that there is no structure to the output!

                            2. Make is fundamentally broken but Hume already built a replacement (http://doc.cat-v.org/bell_labs/mk/). Plan 9’s mk(1) is much improved. It’s available for general use in the plan9port utils (Also there’s a Go version somewhere). One of the really cool features (which might have been backported to make) is that you can write the syntax in any programming language you want, and tell it to use that interpreter. There are a number of other really nice features, and renaming $* to actual names like $target and $prereq makes a lot of difference for intelligibility.

                            3. I just don’t realy know the best way to connect with others that share the same kind of aesthetics and focus in software. As far as I know there isn’t a group for this anywhere.

                            I know a number of people with these aesthetics? Trying to create a group is like trying to herd cats, ha :)

                            At some point it might be worth setting up an IRC or a mailing list, feel free to throw me a Message on here and maybe with enough people we can coördinate something?