Are these benchmarks running time with a binary being fired off? Seems less than ideal to take a single sample apiece.
Cf. https://hackage.haskell.org/package/criterion https://github.com/bos/criterion http://www.serpentine.com/criterion/tutorial.html http://www.serpentine.com/blog/2014/08/08/criterion-1-0/
No. Average of 3: https://github.com/BurntSushi/xsv/blob/master/scripts/benchmark-basic
[Comment removed by author]
Which is why I used words like “rough”, “ballpark” and “unscientific.” I haven’t had the the time or inclination to improve them yet.
As an aside, this is not a good article. It is technically accurate, and I agree with all the sentiment expressed. But you aren’t going to convince anyone to use better stats by linking an aggressive and hostile article like this. Honey vs vinegar, etc.
I take stats very seriously (I come from a molecular biology background, where proper stats are very important), and am all for encouraging developers to perform proper statistical tests. But this article is not really the most convincing argument to put forward. Just my opinion, but it seems needless hostile.
Well I will defend CSV. It is not a horrible format. A horrible format is XLS and oterh spreadsheet formats. I get the craziest text formats from my local government and my businesses data base provider that I can’t help but ask who is saying 40 gb csv is horrible?
If you want to convert it is fine but I think everyone is still thinking 1 gb is “Big Data” and well csv can really handle a lot of data if you just up your ram.
There are plenty of reasonable arguments both ways. One might say that CSV is bad because it is ill-specified or that RFC 4180 does not have wide adoption. Namely, many CSV parsers accept a superset of RFC 4180 and subtley differ in their semantics. Also, the CSV spec states that the text encoding is ASCII, which isn’t that helpful. And I’ve also heard people complain about Excel mangling CSV files, but I don’t know anything about that.
To your point (and in my experience), CSV works well in practice.
Meta point: I try to be upfront about motivation and limitations of my projects, and that’s all I was trying to do with this section of the README. It doesn’t mean that I’m endorsing the view that CSV is horrible, but it does mean that I recognize that as a reasonable criticism.
Nice tool! How does the index work? What is the data structure on disk?
The index is stupidly simple. It just records the byte offset in the CSV file at which each record starts. It always uses 8 bytes (big endian) to represent each integer, so the index has size N * 8 where N is the number of records. Jumping to a record is as simple as two file seek operations and telling the CSV parser to start reading.
N * 8
In theory, you could use the index with any CSV parser! You just have to seek the underlying reader first.
The code that writes the index is extremely simple. Even if you don’t know Rust, this should be pretty readable: https://github.com/BurntSushi/rust-csv/blob/master/src/index/mod.rs#L52-L69
Yes, the code is quite readable :)
If I understand it correctly, the index just gives the byte offset of each row in the CSV file, but it’s not sorted in any way? It is used to jump quickly to the nth row of the CSV file? Is there any other purpose?
It’s not sorted, no. Even this very simple index buys you a ton though. You can all of a sudden process CSV data in parallel because you can jump around to any portion of the data instantly. This greatly speeds up stat collection.
With that said, a simple in memory hash index is used in the xsv join command. It just creates a map from join key to record number, and with the record number and the simpler index, you can jump around the data and join things pretty quickly. The code for building that index is here (still mostly readable, but my eyes are very comfortable with Rust at this point): https://github.com/BurntSushi/xsv/blob/master/src/cmd/join.rs#L334-L393
Also, one of the keys to xsv’s speed (and its underlying CSV reader), is that it doesn’t actually force you to store an entire record in memory. You can iterate the CSV data at the field level, and it’s all safe because of Rust’s lifetime system. (The CSV parser does this by sharing a reference to its buffer. It’s also immutable so you can’t even screw up any internal state!) For example, this code will only allocate an array with the size of the join key, and it will completely skip over any intermediate allocation of the entire record.
Understood. Thanks for the explanation!
With that said, a simple in memory hash index is used in the xsv join command.
How does the join command work if the hash index doesn’t fit in computer’s RAM?
A more general question: I know you are skilled in several programming languages. Why did you choose Rust for this project instead of Go, Haskell or C?
It doesn’t. ;-) Note that it can still work if the full CSV data doesn’t fit in memory, only the join key itself has to fit into memory.
xsv tries to be as conservative with memory usage as possible, so if your CSV data exceeds usable memory, you’re still probably OK. But if you get to a point where a column or two can exceed all memory, then you’ll be in trouble. In practice, join keys are usually pretty small, so they’ll fit into memory even with a large number of rows.
xsv stops at the point where continuing would require maintaining sophisticated on-disk data structures. If the in memory hash index becomes a problem, I might consider moving that to disk like I did for the simpler index, but I think that’s probably as far as I’ll go.
I don’t think there’s a satisfying answer to this question. At the moment, I’m having a lot of fun with Rust, so a lot of the problems I have, I end up wanting to hit them with the Rust hammer.
Rust, like Haskell, is worth learning if only to have your mind bent by the borrow checker. Everything after that is gravy! :-)
Note that it can still work if the full CSV data doesn’t fit in memory, only the join key itself has to fit into memory.
xsv stops at the point where continuing would require maintaining sophisticated on-disk data structures.
It sounds like a reasonable tradeoff for most usages.
And thanks for sharing the “why” of Rust!
One neat thing that you all might be interested in is that xsv’s command line interface is tested with QuickCheck (plus a healthy dose of standard unit tests). For example, here are some tests that check whether xsv sort is correct or not: https://github.com/BurntSushi/xsv/blob/master/tests/test_sort.rs
Using QuickCheck for this was really awesome because it revealed dozens of bugs related to how I handled corner cases in the CSV data, particularly if the handling of those corner cases was different in the presence of an index or not. (The presence of an index usually results in a completely different code path.)