xsv is so good that I’ve written mission-critical analysis software around it. It’s handled just about everything that I’ve thrown at it, sometimes with some options tweaking or the rare iconv preprocessing since xsv doesn’t handle invalid UTF-8 input.
Edit: BurntSushi reached out to me on Twitter to say that xsv does handle invalid UTF-8 so I’m wrong or I’m misremembering something about building a pipeline that eventually required an iconv filter before passing to the rest, most of which was xsv but there were some other programs in there, too.
As a warning I’d like to point out that using awk and jq splitting on just , will subtly break if you have strings in your CSV files that contain that character as part of the string. Good stuff though!
That spreadsheet programs decided to call what they do “CSV” has caused no end of confusion since the expansion of the acronym literally contradicts the format { unless you expand it “Commas (don’t actually) Separate (all possible) Values” ;) }. I would not be shocked to hear that internally they wished they could’ve used “EQCSV” but 1970s era DOS 8.3 filename limits motivated “CSV” instead.
Once confident ASCII newline only terminates records, if you have big files & enough space (temporarily anyway) for a copy you can also then segment by “nearest ASCII newline to 1/N” and parallelize processing on it.
I have taken to using ASCII characters 28-31 to build any “CSV” files I create. It eliminates any delimiter collision and, being ASCII, is nicely handled by most tools.
It is historically odd that people don’t use those characters more. It’s not like they’re in the 128-255 range and can’t be used or something. They’re right there, universally available, and designed for specifically this purpose. The only problem with them is they can’t encode nested records/arbitrary data, but no one wants that anyway.
Another downside is if you’re producing a CSV for someone else, you’ll have to explain to them such delimiters exist and how to make use of them if they’re opening the file in Excel.
I guess the answer to the immediate historical question is that Excel supports CSV, so CSV is popular. But the deeper question remains: Why did Excel use CSV instead of the actual delimiters that were designed for exactly this purpose? Then again Excel was clearly, with all due respect, designed by amateurs, who for example don’t know how leap years work, so it wouldn’t be out of character for them to just not be aware of the ASCII separator fields either.
Why did Excel use CSV instead of the actual delimiters that were designed for exactly this purpose?
Because those delimiters aren’t on keyboards. Spreadsheet programs before Excel used CSV too.
Then again Excel was clearly, with all due respect, designed by amateurs, who for example don’t know how leap years work, so it wouldn’t be out of character for them to just not be aware of the ASCII separator fields either.
Excel’s leap year bug is intentional because it’s supposed to be 100% backwards compatible with Lotus, and that includes having the exact same logic bugs.
Because those delimiters aren’t on keyboards. Spreadsheet programs before Excel used CSV too.
This is precisely the problem. If I output a CSV using these symbols in one of my projects, then I also have to provide handy copy/paste symbols in the README and hope someone comes back to find that instead of giving up as soon as it looks funky when they open it. These are great for my personal use, but I don’t consider them appropriate for general consumption if I want my code to be used by many.
csvkit is much better for handling CSV files. If you try and use awk -F, to parse CSVs somebody will throw a field containing a comma and you’re in for a world of hurt.
Parsing it using the classic text tools fails because such tools does not understand the particular format. (see also Classic pipeline example). We get e.g. a single value as two values (Weird + Freak) or only a part of a single value (missing , PhD ):
$ awk -F, '{print $1}' test.csv
name
Joe
"Weird
Freak"
Jenny
"Marry
We also get quotes in values and scrambled content:
CSV also uses CRLF line ends, not only LF, which these tools also do not understand and pollute the output with \r (escaped) or 0x0d bytes.
This „simple“ approach gives quite random and unreliable results. Reliable solution is to use a software that has a parser for given format (even in cases of formats that might looks „simple“, „plain-text“ and „obvious“). Many programs have this parser.
Such issues are also one of reasons why Relational pipes are being developed.
You might be interested in xsv (https://github.com/BurntSushi/xsv), A fast CSV command line toolkit written in Rust :)
xsv is so good that I’ve written mission-critical analysis software around it. It’s handled just about everything that I’ve thrown at it, sometimes with some options tweaking or the rare
iconv
preprocessing since xsv doesn’t handle invalid UTF-8 input.Edit: BurntSushi reached out to me on Twitter to say that xsv does handle invalid UTF-8 so I’m wrong or I’m misremembering something about building a pipeline that eventually required an iconv filter before passing to the rest, most of which was xsv but there were some other programs in there, too.
As a warning I’d like to point out that using awk and jq splitting on just
,
will subtly break if you have strings in your CSV files that contain that character as part of the string. Good stuff though!That spreadsheet programs decided to call what they do “CSV” has caused no end of confusion since the expansion of the acronym literally contradicts the format { unless you expand it “Commas (don’t actually) Separate (all possible) Values” ;) }. I would not be shocked to hear that internally they wished they could’ve used “EQCSV” but 1970s era DOS 8.3 filename limits motivated “CSV” instead.
The best approach here is to have a “translator program” convert escape-quoted-CSV into values “actually separated” by delimiter bytes: https://github.com/c-blake/nio/blob/main/utils/c2tsv.nim or its C port are one example. As part of a pipeline you likely get parallel speed up. Arbitrary data fields can be done by https://en.wikipedia.org/wiki/Tab-separated_values#Conventions_for_lossless_conversion_to_TSV Anyway, then you can use ideas from the article without as much worry.
Once confident ASCII newline only terminates records, if you have big files & enough space (temporarily anyway) for a copy you can also then segment by “nearest ASCII newline to 1/N” and parallelize processing on it.
I have taken to using ASCII characters 28-31 to build any “CSV” files I create. It eliminates any delimiter collision and, being ASCII, is nicely handled by most tools.
It is historically odd that people don’t use those characters more. It’s not like they’re in the 128-255 range and can’t be used or something. They’re right there, universally available, and designed for specifically this purpose. The only problem with them is they can’t encode nested records/arbitrary data, but no one wants that anyway.
Another downside is if you’re producing a CSV for someone else, you’ll have to explain to them such delimiters exist and how to make use of them if they’re opening the file in Excel.
I guess the answer to the immediate historical question is that Excel supports CSV, so CSV is popular. But the deeper question remains: Why did Excel use CSV instead of the actual delimiters that were designed for exactly this purpose? Then again Excel was clearly, with all due respect, designed by amateurs, who for example don’t know how leap years work, so it wouldn’t be out of character for them to just not be aware of the ASCII separator fields either.
Because those delimiters aren’t on keyboards. Spreadsheet programs before Excel used CSV too.
Excel’s leap year bug is intentional because it’s supposed to be 100% backwards compatible with Lotus, and that includes having the exact same logic bugs.
This is precisely the problem. If I output a CSV using these symbols in one of my projects, then I also have to provide handy copy/paste symbols in the README and hope someone comes back to find that instead of giving up as soon as it looks funky when they open it. These are great for my personal use, but I don’t consider them appropriate for general consumption if I want my code to be used by many.
csvkit is much better for handling CSV files. If you try and use awk -F, to parse CSVs somebody will throw a field containing a comma and you’re in for a world of hurt.
Miller (mlr) is a great tool for working with CSV: https://miller.readthedocs.io/en/latest/
SQLite is great for handling CSV too
I always forget that it has CSV import…
CSV is quite nice and simple data format, but it is not as simple as some people expect. For example this is valid CSV (RFC 4180):
Parsing it using the classic text tools fails because such tools does not understand the particular format. (see also Classic pipeline example). We get e.g. a single value as two values (
Weird
+Freak
) or only a part of a single value (missing, PhD
):We also get quotes in values and scrambled content:
CSV also uses CRLF line ends, not only LF, which these tools also do not understand and pollute the output with
\r
(escaped) or0x0d
bytes.This „simple“ approach gives quite random and unreliable results. Reliable solution is to use a software that has a parser for given format (even in cases of formats that might looks „simple“, „plain-text“ and „obvious“). Many programs have this parser.
Such issues are also one of reasons why Relational pipes are being developed.
Few examples:
Read a CSV and display it as a table in console:
Filter only the first attribute (name):
Filter records by value: