1. 6

ConnectorX is the fastest tool that helps you to load data from databases to dataframes in Python. You will get up to 10x speed up as well as with 3x less memory consumption for loading data from databases to Pandas, by simply changing pandas.read_sql to connectorx.read_sql!

Check out more information on our project here:

Github: https://github.com/sfu-db/connector-x

Blog: https://towardsdatascience.com/connectorx-the-fastest-way-to-load-data-from-databases-a65d4d4062d5

PyPI: https://pypi.org/project/connectorx/

crates.io: https://crates.io/crates/connectorx

Here are some questions that we hope to hear the answers from you:

  • Do you want us to cache the query results onto the disk so that the second time you load the same table will be much faster, at the cost of some disk spaces?
  • Do you have the need to load data from different data sources and join / concat them together into a single dataframe?
  • Other than relational databases, what other data sources do you want us to support?
  • Other than dataframes, what other destinations do you want us to support?
  • What other features do you want us to support?

If you have any other feedback or questions, please let us know!

  1.  

  2. 1

    Author here! Happy to hear any feedbacks or answer any questions about our project.

    1. 1

      How would you reconcile your project with the concept of “potato programming” as applied to databases? I have seen many systems that are unnecessarily slow and inefficient because they are focused on streaming data from a database into a general-purpose language rather than writing SQL in order to process the data inside the database.

      1. 1

        Good question. IMO whether to process the data inside the database using SQL or even stored procedure, or export the data out and process them using a programming language really depends on the scenario. If it is still in the process where your data science team is building a model, you might want to export the data and use Python to finish the job quick and dirty. On the other hand, if you are at the stage to productionize a system, you might want to choose SQL to push down most of the computations because that’s more efficient.

    2. 1

      Any plans on porting the loader to Julia and DataFrames.jl?

      1. 2

        Thanks for the suggestion! Indeed we are also considering to add a Julia support in the future.

      2. 1

        Looks great! What I don’t understand is how you achieve the memory savings after everything has been loaded. If you construct the same dataframe with the same data, how can you use less memory in the end? Is the pandas/dask implementation holding on to objects that should be garbage collected? Or are these numbers peak memory usage during the load?

        1. 1

          Yeah we measured the peak memory. It is very likely that there were lots of intermediate objects held somewhere when using Pandas to load the data, although we haven’t investigate deep into Pandas.