I’ve been playing with Arrow recently in what you might call a “data science” application and it’s been pretty decent. Basically I’ve got an ever-growing dataset of observations and model predictions for the thing that I’m (very slowly) writing a scientific paper about. The data lives in a PostgreSQL database, because that’s what I’ve got, and because the most recent portion of it is constantly being updated as the data comes in from different model runs.
First, I wrote an endpoint on the server that would do a DB query and serve up the full resultset as JSON to the different graphing and analysis tools that I wrote, which would load it with Pandas read_json and do whatever.
When that started getting too slow, I modified the server code to use DB cursors and to stream the result as JSON a few thousand rows at a time, instead of materializing the whole result and then sending it.
When that started getting too slow, I wrote a new script that would fetch the JSON from the server and write it out to disk as a Parquet dataset, and then had my graphs and whatnot read from the Parquet. That way I can produce a dozen different graphs with a single transfer from the server. (This is also good for reproducibility: if I archive the dataset, anyone can run the same scripts against it later on.)
And when that started getting slow and bandwidth-intensive, I switched from streaming JSON to streaming Arrow. Like the article says, Arrow and Parquet go well together. Parquet compresses down to a smaller size and is better for random access, but Arrow is streamable. It took me a little while to figure out how to drive PyArrow, how to get chunking and schema inference to mix, and a few other things, but in the end the code ended up pretty manageable. The lower serialization overhad about quadrupled the speed of syncing the data.
I’ve been playing with Arrow recently in what you might call a “data science” application and it’s been pretty decent. Basically I’ve got an ever-growing dataset of observations and model predictions for the thing that I’m (very slowly) writing a scientific paper about. The data lives in a PostgreSQL database, because that’s what I’ve got, and because the most recent portion of it is constantly being updated as the data comes in from different model runs.
First, I wrote an endpoint on the server that would do a DB query and serve up the full resultset as JSON to the different graphing and analysis tools that I wrote, which would load it with Pandas
read_json
and do whatever.When that started getting too slow, I modified the server code to use DB cursors and to stream the result as JSON a few thousand rows at a time, instead of materializing the whole result and then sending it.
When that started getting too slow, I wrote a new script that would fetch the JSON from the server and write it out to disk as a Parquet dataset, and then had my graphs and whatnot read from the Parquet. That way I can produce a dozen different graphs with a single transfer from the server. (This is also good for reproducibility: if I archive the dataset, anyone can run the same scripts against it later on.)
And when that started getting slow and bandwidth-intensive, I switched from streaming JSON to streaming Arrow. Like the article says, Arrow and Parquet go well together. Parquet compresses down to a smaller size and is better for random access, but Arrow is streamable. It took me a little while to figure out how to drive PyArrow, how to get chunking and schema inference to mix, and a few other things, but in the end the code ended up pretty manageable. The lower serialization overhad about quadrupled the speed of syncing the data.