Author here! Happy to hear any feedbacks or answer any questions about our project.
How would you reconcile your project with the concept of “potato programming” as applied to databases? I have seen many systems that are unnecessarily slow and inefficient because they are focused on streaming data from a database into a general-purpose language rather than writing SQL in order to process the data inside the database.
Good question. IMO whether to process the data inside the database using SQL or even stored procedure, or export the data out and process them using a programming language really depends on the scenario. If it is still in the process where your data science team is building a model, you might want to export the data and use Python to finish the job quick and dirty. On the other hand, if you are at the stage to productionize a system, you might want to choose SQL to push down most of the computations because that’s more efficient.
Looks great! What I don’t understand is how you achieve the memory savings after everything has been loaded. If you construct the same dataframe with the same data, how can you use less memory in the end? Is the pandas/dask implementation holding on to objects that should be garbage collected? Or are these numbers peak memory usage during the load?
Yeah we measured the peak memory. It is very likely that there were lots of intermediate objects held somewhere when using Pandas to load the data, although we haven’t investigate deep into Pandas.
Any plans on porting the loader to Julia and DataFrames.jl?
Thanks for the suggestion! Indeed we are also considering to add a Julia support in the future.