nice, i wrote a suite of tools that interoperate with this exact file format description a few years ago. i think the csv to tsv conversions might replace tabs with spaces instead of erroring but otherwise this is exactly that
The intention is that it would be used in cases where newline and tab do not occur.
It is of course possible to use -escaping to include tabs and newlines. but in these cases on might prefer a different file format.
My main line of thought at the moment is whether UNIX tool output “should” emit data like this (and just break things when filenames have tabs, newlines) or escaped data.
Ah, yes. I agree that TSV-format data emitted on stdout should not contain tabs or newlines within the contents of fields.
My use-case here is different, kinda: SELECTing the contents of database tables, loading it somewhere, sharing it, doing some kind of editing or transformation, sending it back to the DB, etc.
Transformations frequently consist of what is sometimes called meta-programming, but I’m more likely to call text manipulation. I’ll get a bunch of data into a table (TSV) and then use sed, or emacs macros, or whatever fits for the task at hand, to alter it. For example to suffix “_old” onto all the values in the third column, wrap the non-null values in single quotes, replace the tabs with commas, and prepend “INSERT INTO …” on every line. Now my tabular data, which may have come from a spreadsheet or a previous SELECT is a script, ready to run. And it doesn’t matter if there were ten lines or ten thousand, my expenditure of effort was the same.
That’s all the same, in principle, as manipulating the output of unix utilities. The difference is, I don’t get to choose whether tabs and newlines are included in the contents of fields. They are, usually. :(
One of the issues is that we need escaping since our fields can have tab and newline. The answer would be to -escape output. But then we need to search for \t instead of a tab character inside sed. So either our sed scripts get a bit more complex or we add an option for sed to handle unescaping, doing re then and escaping the result.
Yep, and naive escaping will cause unexpected results if/when \t or \n exist. So then, you have to decide, are you going to escape all instances of \, or just those followed by t or n? Either is valid, so long as the escaping mechanism’s decision matches the un-escaping mechanism’s decision… :) I get it.
Simply forbidding (via throwing an error) the presence of tab or newlines in the field contents is novel to me. I like it.
nice, i wrote a suite of tools that interoperate with this exact file format description a few years ago. i think the csv to tsv conversions might replace tabs with spaces instead of erroring but otherwise this is exactly that
https://github.com/jtolds/tsv-tools
Very nice! I added a link to it from my doc
rain1, so, this solves the problem of ‘literal tabs’ and ‘literal newlines’ by simply forbidding them?
I will consider your proposal.
It can’t be my one and only tabular data format, obviously.
Or, could it? I could embed ‘\t’ and ‘\n’ (two-char sequences) locally, without tool support.
That’s correct:
The intention is that it would be used in cases where newline and tab do not occur.
It is of course possible to use -escaping to include tabs and newlines. but in these cases on might prefer a different file format.
My main line of thought at the moment is whether UNIX tool output “should” emit data like this (and just break things when filenames have tabs, newlines) or escaped data.
Ah, yes. I agree that TSV-format data emitted on stdout should not contain tabs or newlines within the contents of fields.
My use-case here is different, kinda: SELECTing the contents of database tables, loading it somewhere, sharing it, doing some kind of editing or transformation, sending it back to the DB, etc.
Transformations frequently consist of what is sometimes called meta-programming, but I’m more likely to call text manipulation. I’ll get a bunch of data into a table (TSV) and then use sed, or emacs macros, or whatever fits for the task at hand, to alter it. For example to suffix “_old” onto all the values in the third column, wrap the non-null values in single quotes, replace the tabs with commas, and prepend “INSERT INTO …” on every line. Now my tabular data, which may have come from a spreadsheet or a previous SELECT is a script, ready to run. And it doesn’t matter if there were ten lines or ten thousand, my expenditure of effort was the same.
That’s all the same, in principle, as manipulating the output of unix utilities. The difference is, I don’t get to choose whether tabs and newlines are included in the contents of fields. They are, usually. :(
One of the issues is that we need escaping since our fields can have tab and newline. The answer would be to -escape output. But then we need to search for \t instead of a tab character inside sed. So either our sed scripts get a bit more complex or we add an option for sed to handle unescaping, doing re then and escaping the result.
Yep, and naive escaping will cause unexpected results if/when
\tor\nexist. So then, you have to decide, are you going to escape all instances of\, or just those followed bytorn? Either is valid, so long as the escaping mechanism’s decision matches the un-escaping mechanism’s decision… :) I get it.Simply forbidding (via throwing an error) the presence of tab or newlines in the field contents is novel to me. I like it.