1. 37

Sadly, there isn’t a Nice things we chose not to have and then bodged together later tag. :(

Previously submitted by @tedu a while ago.

  1.  

  2. [Comment removed by author]

    1. 5

      ^V28 for me. C-q for emacs folks, and the number varies based on what they set read-quoted-char-radix to.

      But why type this stuff?

      1. 3

        Alternatively, ^V^_ and ^V^^ in Vim. Don’t forget that all ASCII control codes can be typed with ctrl.

    2. 6

      A perverse benefit of comma and tab being common is that you’re quickly forced to deal with quoting, escaping, or encoding them. Without one, this will blow up in somewhere down the line when a multi-byte Unicode codepoint includes one of these bytes.

      1. 5

        Without one, this will blow up in somewhere down the line when a multi-byte Unicode codepoint includes one of these bytes.

        AIUI, UTF-8 won’t trip this since it is ASCII compatible. That is, any ASCII byte in valid UTF-8 encoded bytes corresponds to its ASCII codepoint, regardless of context. (All continuation bytes are at least 0x80 and all leading bytes for codepoints greater than U+007F are at least 0xC0, which makes the 0x1E and 0x1F separators safe.)

        UTF-16 or UTF-32 on the other hand…

        1. 3

          Parsing UTF-{16,32} byte-by-byte may trip on any ASCII character: e.g., U+012C LATIN CAPITAL LETTER I WITH BREVE and U+2C00 GLAGOLITIC CAPITAL LETTER AZU, as well as up to 510 other characters (in UTF-16; more in UTF-32), will clash with a comma.

          1. 2

            Parsing UTF-{16,32} byte by byte will not work regardless, making this case irrelevant.

            Even if you get lucky and the code handling it byte by byte doesn’t choke on half to three quarters of the characters being “string terminators”, then it will garble the output horribly when it touches the single byte being used as a separator and shifts the alignment of the rest of the text.

            If you’re treating commas in UTF-32 text as single bytes, you’ve already lost. You might as well be trying to parse a JPEG screenshot of a CSV.

            1. 1

              Parsing UTF-{16,32} byte by byte will not work regardless, making this case irrelevant.

              Yes, that was my point, and that neither in UTF-8 nor in other Unicode encodings HT or comma has any advantages over FS/GS/RS/US. Should’ve stated it more clearly, perhaps.

              1. 1

                Yes, that was my point, and that neither in UTF-8 nor in other Unicode encodings HT or comma has any advantages over FS/GS/RS/US. Should’ve stated it more clearly, perhaps.

                Just to be clear: UTF-8 does not have this problem, and parsing it byte by byte will work fine. There exist no valid ASCII characters that will show up as a byte in any other UTF8-encoded character.

        2. 2

          Ugh, good point. :(

        3. 6

          Interesting! This is very pertinent to a thing I’m working on. I recently decided to use \x1F (31 Unit Separator) to encode a set of strings that could validly include commas, tabs, newlines, etc into a single string. I could be pretty sure that \x1F wasn’t part of the data, or at least sanitize it out of the input. I was able to do this precisely because it isn’t a character a human could type or would be interested in. Its a simple solution, works very elegantly, and no dealing with all the flavors of quoted csv.

          1. 4

            and no dealing with all the flavors of quoted csv.

            Preach.

            I used to do quite a bit of “integration engineering” - which almost always means “take this input file (almost always CSV or tab-delimited) and do process X, line by line”. I’ve never seen a CSV that could be parsed the first time. I’ve tried yelling and pleading with people to use RFC 4180 because at least you have something to work against, but nothing.

            I suspect a big reason we see this happening is that generating a CSV is a low difficulty task, and Excel makes it easy to do manual/semi-automated work to the files. But if you ask for US/RS delimited files, or (heaven help us) a fixed-width file, you’ll end up with a programmer in the loop.

          2. 5

            So how do text editors display these characters? Space and newline, respectively? This sounds elegant. But it also makes human creation and editing of files troublesome. I think I prefer csv because I myself can write it.

            1. 1

              I think the benefit accrues when you aren’t writing the files by hand. If you’re writing the file by hand you needn’t worry about escaping the delimiter because you just don’t use the delimiter in your data. But if you’re accepting input from somewhere else and generating the file programmatically, then you do need to worry about escaping the delimiter and, at the same time, you don’t need to worry about typing the characters.

            2. 1

              The tab-separated value format understood by Postgres and MySQL (a) handles newlines and tabs with escaping and (b) still ensures every row fits on one line.

              This is something I actually did a write up for, some months/years ago: http://dataprotocols.org/linear-tsv/#motivation

              1. 1

                The other ones which show up occasionally are STX, ETX, ENQ, ACK, NAK, etc … there’s a whole suite of things derived from old standards like ASTM-E-1381 which use these for sending ASCII messages over serial channels. Turns out that most of the stuff down at the bottom of ASCII had a good reason to exist …