1. 11

    1. 5

      I’ve been playing with Arrow recently in what you might call a “data science” application and it’s been pretty decent. Basically I’ve got an ever-growing dataset of observations and model predictions for the thing that I’m (very slowly) writing a scientific paper about. The data lives in a PostgreSQL database, because that’s what I’ve got, and because the most recent portion of it is constantly being updated as the data comes in from different model runs.

      First, I wrote an endpoint on the server that would do a DB query and serve up the full resultset as JSON to the different graphing and analysis tools that I wrote, which would load it with Pandas read_json and do whatever.

      When that started getting too slow, I modified the server code to use DB cursors and to stream the result as JSON a few thousand rows at a time, instead of materializing the whole result and then sending it.

      When that started getting too slow, I wrote a new script that would fetch the JSON from the server and write it out to disk as a Parquet dataset, and then had my graphs and whatnot read from the Parquet. That way I can produce a dozen different graphs with a single transfer from the server. (This is also good for reproducibility: if I archive the dataset, anyone can run the same scripts against it later on.)

      And when that started getting slow and bandwidth-intensive, I switched from streaming JSON to streaming Arrow. Like the article says, Arrow and Parquet go well together. Parquet compresses down to a smaller size and is better for random access, but Arrow is streamable. It took me a little while to figure out how to drive PyArrow, how to get chunking and schema inference to mix, and a few other things, but in the end the code ended up pretty manageable. The lower serialization overhad about quadrupled the speed of syncing the data.