I’m wondering why people forgot about the unit/record separators in ASCII. It’s not as human readable/writeable, but its certainly less fragile.
In case you’re wondering what this is, ASCII codes 0x1c (aka “field separator”) and 0x1e (aka “row separator”) were reserved for CSV-style tabular output use, the idea being that those codes, being in the “control characters” set of non-printable characters, would not appear in normal output and therefore could solve some of the reading problems in the OP, especially with regard to escaping delimiters.
It doesn’t solve the well-formedness problem, so you still have to parse in any case, instead of just blindly splitting on those values.
If you presume all data is human readable, then you can try to blindly split. Even if not, you can reserve a character for escaping, and simply repeat to use that character.
You don’t need to reserve a character for that. DLE is defined as “Data Link Escape”.
I occasionally remember they exist, but they don’t really fit a sweet spot for me personally (nether does CSV, though). What I use are one of two things:
For simple tabular data I use TSV (tab-separated values). This makes things human-readable and also makes it easy to use traditional Unix command-line tools, and has the sole restriction that fields themselves cannot have embedded tabs (in TSV there’s no way to escape a tab, they’re simply not allowed in fields). I could use the ASCII record separators here, but it’d be more awkward to read/write the files, for the only advantage of being able to embed tabs in fields, which is something I rarely actually want.
For anything more complicated than basic tabular data where TSV is fine, then I go to a more full-fledged serialization format like XML or JSON, with well-defined escaping rules and support for things like hierarchical data.
I don’t want to write my own CSV code, but I have to work with CSV producers who wrote their own. A parser that handles quoted text containing commas correctly doesn’t help if the producer doesn’t quote their commas.
I’ve encountered errors like the one’s mentioned here and I’ve never even rolled by own CSV code before.
It’s actually a pretty terrible “standard.”
There is an actual CSV standard, namely RFC 4180. So it’s not quoted-standard. Whether producers and consumers follow the standard is a different matter.
Writing your own CSV parser isn’t that hard though. This post pretty much tells you all you need to know, it’s extremely straightforward to write a parser that handles all of it. If you’ve ever written a basic lexer for a programming language with strings you’ve done more than a CSV parser.
It’s straightforward if your users are cool with your parser barfing on their malformed inputs. Lexer users expect it to barf. Not CSV users.