1. 11

  2. 4

    I haven’t yet spent much time on performance. I intend to profile it properly at some point, but for now, it’s “good enough”: on a par with using Go’s encoding/csv package directly, and much faster than the csv module in Python. It can process a gigabyte of complex CSV input in under five seconds on my laptop.

    When did Go get faster than Python at CSV parsing? For a long time, Python was faster.

    1. 2

      I think I remember you saying that on another thread, and I was surprised (which is part of the reason I benchmarked against Python). I realize most of the good stuff in Python’s csv module is implemented in C, so I guess I shouldn’t be surprised either way. My guess is Python does more memory allocations, or the dynamic typing / object overhead is significant.

      In any case, the numbers in my benchmarks speak for themselves: Python takes 3x as long for both input and output. Feel free to try my benchmarks yourself – you’ll need a huge.csv file, as I haven’t checked it into the repo (or I can send you a link to the one I’m using if you want).

      I’d love to see numbers where Python was faster. It’s possible that was “back in the day” when Go produced executables 5-10x as slow as they are today: https://benhoyt.com/writings/go-version-performance/

        1. 3

          Thanks! For what it’s worth, below is the best-of-3 time in seconds of Go’s encoding/csv.Reader using this script on a 200GB, 100,000-row CSV file with 286 columns:

          Go version,Time (s)

          The Go version that issue reporter was using was 1.7. It gets significantly faster in Go 1.8 and 1.9, and then way faster in Go 1.10. I think it’s mainly due to CL 72150, but I’m not sure. In any case, between encoding/csv and Go compiler improvements, Go’s CSV reading is 5-6 times as fast now as it was then.

          1. 2

            Wow! That’s pretty remarkable improvement. It might be worth a blog post by itself. :-)

            The other traditionally “slow” package in Go was regex. I remember my friend’s company was using a C-wrapper library to avoid the slow Go regex package. I wonder what would happen if they retested it now.

            1. 1

              Feel free to write that – I’m blog-posted out for a bit. :-)

              Unfortunately regexp is still slow, though I did see a couple of CLs recently that might lead to some improvements, for example https://go-review.googlesource.com/c/go/+/355789

    2. 3

      Nice work and the blog post was well written too. Good to know it was sponsored as well (by University of Antwerp).

      For csv parsing, there’s also this stackoverflow thread: https://stackoverflow.com/questions/45420535/whats-the-most-robust-way-to-efficiently-parse-csv-using-awk

      Would the next step be support for xml, json, etc (like https://github.com/TomWright/dasel)? ;)

      1. 2

        I think there has been a bad trend of calling “rfc4180 quoted CSV” just “CSV” as if that is the “one true meaning”. Unquoted CSV is not “improper” if the data does not require more. RFC1480 is pretty clear it’s not some kind of “official standard” (like some other RFCs), but just trying to wrangle one description from the chaos. When the ‘S’ actually means what it stands for, regular awk -F can work fine as can the term “CSV”. Given the trend (& chaos), the way to go now may be to just always qualify…quoted CSV, rfc1480 CSV, unquoted CSV, Excel CSV, etc.

        More substantively, in terms of benchmarking, the pipeline approach with csvquote as the article mentions (or c2tsv) may be less convenient, but it may also allow you to get better wall clock with multi-core (depending on..lots). If you have space to save a processed copy then you can avoid re-doing the extra work all the time. Also, once it is in a well-delimited random access file, you can do non-indexed parallel processing on a big file just by using random access to skip to the next newline like this.

        There may be an awk optimization to be had to check if stdin is mmapable to automate this. You’re basically on the road to a map-reduce mechanism wired into awk..somehow, probably with some built-in aggregator/reducers functions. And, of course, disk head seek self-competition can be a problem, but flash storage (or big enough /dev/shm even) is really common these days. As are lots of cores on laptops and better memory bandwidth to multiple cores than any single core.

        1. 2

          I’m not quite sure, but your first paragraph seems to be saying that RFC 4180 requires fields to be quoted, but the RFC definitely allows both “escaped” and “unescaped” fields (to use the wording of the RFC grammar). I agree there are CSV files that aren’t parseable by RFC 4180 rules, but these days they’re few and far between. In addition, GoAWK’s code uses the (effect of the) encoding/csv.Reader LazyQuotes=true and FieldsPerRecord=-1 options, making it a bit more forgiving.

          Interesting points about csvquote and multicore – I hadn’t considered that. (Though eking out the last bit of performance is not my biggest concern at the moment. For GoAWK’s CSV support, I wanted to get the semantics “right”, and I’ll focus on performance later.)

          1. 2

            Apologies for any unclarity. I am saying since most software lets you change the 2 delims, “CSV/TSV” terms are muddy. I get the need to refer to the specific. I even called my program “c2tsv”. So, I’m aware of all that, but I do not think there is “one true CSV”, except for the abstract umbrella term suggesting “mostly” split-parseable data with scare quotes to be clarified. Pretending there is a true “CSV” confuses more than helps conversations/innovation.

            Being “forgiving” is part of the problem, which is where my later text came from. It is actually the optionality of quoting combined with its non-locality which causes trouble with unindexed segmentation. With well delimited rows and random access files, you can just jump to 1/nCPU into the stat-knowable bytes and scan for a row delimiter (as done in my link). With optional quotes you may have to scan back to the beginning and this may even be the common case, e.g. if there is no quote in whole files.

            You can build a [rowNumber, byteNumber] index, but then that touches all data. There may be “most of the time for me & my data” workarounds, but with “delimiter-escaped soundly split parseable CSV” (or maybe DECSV for short) there is no problem (other than bigness coming from many rows not a few giant ones and statistical regularity).

            csvquote just moves delims to “unlikely” chars. So, you’d be in trouble with JPG fields (but many Unix utils would choke on NUL bytes anyway…So, maybe out of scope).

            I suspect “C vs. DOS/Win FS” backslash wars or “Excel job security” may have inhibited DECSV adoption. You can pick a less colliding escape, but many end-user subpopulations are more likely to interpret "\n" in fields correctly than, say, "^An". I’ve not read through your format war back-links, but I’d expect these ideas arose there. Sorry, I did not intend to rehash all that, but it’s all related.

            Whenever there is a complex format like rfc4180 CSV with a simpler to work with alternative like DECSV, it seems better to have a lib/program convert to the simple. Besides parallelism from a pipeline|segmentation, one could hope that it may someday become what is used more commonly without conversion and easier parsing by programmer-users without bugs/test suites/etc.