1. 20
  1.  

  2. 6

    The best way to do this is to convert from awkward “Microsoft-ish quoted CSV” to “strict TSV” as a pipeline preprocessing stage (which can also run in parallel from the pipe consumer). This can be like a 100 line program in a fast language like Nim. Run time is similar to a less cautious "tr , \\t". You can always save the converted data if you need to repeat any processing a lot.

    Parenthetically, I think it is unfortunate that a purely informational RFC (explicitly not a standard) and/or UIs have allowed “CSV” to come to mean (in some circles such as text in the linked article) one very specific “quoted CSV” format. Classic “awk -F,-like CSV” or maybe “simple split CSV” is perfectly sensible already. If commas or newlines can occur in fields then the data is not really “separated”. So, “CSV” is a contradiction, not a description.

    Anyway, in my opinion, given the unfortunate situation now people should always qualify what kind of “CSV”, e.g. “RFC4180 CSV” or at least “quoted CSV”. Awk handles “splittable CSV” out of the box. It is probably a pipe dream to hope clear terminology prevails though.

    1. 2

      The best way to do this is to convert from awkward “Microsoft-ish quoted CSV” to “strict TSV” as a pipeline preprocessing stage (which can also run in parallel from the pipe consumer).

      My “oh god, CSV again, come on” pipeline starts with:

      xsv fmt --quote-always -t (echo -en '\x1C') input.csv

      using @burntsushi ’s wonderful xsv. You have to deal with more ceremony around quoting, but at least it is unambiguous.

      It is probably a pipe dream to hope clear terminology prevails though.

      I once specified CSV as an input format for a file from customers to try to be accommodating. It didn’t go well. I think I’ve had better luck with fixed-length files.

      1. 1

        It may vary with your data, but I just timed c2tsv as ~3X faster than xsv on just selecting all columns and reformatting that World Cities Population test file.

        For awk, as per the article, unquoted but with “out of the way” delimiters is probably more convenient than --quote-always, no?

        Anyway, it’s best to transform to the most CPU/prog.lang friendly format possible, and best to do that only once. For compiled languages, that is binary, but c2tsv can be a useful preprocessor for a later stage parse to binary. That is why it is in that nio package. You can almost think of such loaders as “compiling your data”. Then you just want good “debuggers” like nio print to debug your data. (EDIT: but, of course, this is “engineering” not “customer management” which is, as you suggest, a whole other can of worms.)

    2. 3

      I wrote this, thanks for sharing. csvquote is a handy little tool that uses a handy little trick of replacing quotes and newlines inside a quoted section with control characters that can later be reversed back out.

      It probably has use in formats besides CSV files.

      1. 1

        This is really clever - using ASCII “record separator” and “unit separator” to escape \n and ‘,’. They’re < 128 so this is utf-8 safe.

        1. 3

          There are a whole bunch of unprintables < 32. The NUL byte/0 is not bad as a field separator and emacs RMAIL has long used ASCII 1 as a delimiter. Etc. I am usually a bit surprised when people choose ASCII chars whose names actually suggest their roles.

          Why, if you terminate all fields with \0 instead of “separating/delimiting” them with some final record separator then you could even mmap the file read-only and random access any string as a C string with no allocation (well, the kernel allocates pages for you upon mmap). Of course you would need a (not bad to build, very fast to access) index (to rows and/or rows & cols) to make that format useful.

      2. 2

        I definitely enjoyed using tools from https://csvkit.readthedocs.io/en/latest/ in the past and would recommend them for all sorts of scenarios, including pre-processing before passing data to awk.

        1. 2

          There are many particular solutions and hacks, but for me the good approach is: use common data model and separate 1) input filters, 2) transformations, 3) output filters. Thanks to common data model, you can reuse and combine the tools/filters together in many ways. Maybe I should translate this to English: Předpověď počasí v terminálu & Relační roury (however the pictures and machine translation should give you the idea)