1. 3

  2. 4

    Why is this compared to Pandas? It would be more appropriate to compare it to Dask or Apache Drill. Pandas by design always needs to load the whole file into memory. From there on all computation are fast and no data needs to be loaded from disk ever again. To illustrate this another unfair comparison would be to load a table and then execute a million different queries on it. Pandas would be way faster here and DuckDB would always need to read from disk again. Dask is a lazily executed and query optimizing extension of Pandas. It will push down the query to the file reader and execute Pandas operations only on the needed parts. Apache Drill is a SQL engine for files (csv, json, parquet, etc.). Its very powerful regarding storage backends and flexibility. Both tools can easily be installed on a laptop and are able to scale to a cluster. While DuckDB sounds like a neat tool, it just works for a very narrow use case. If it is actually faster* than those tools I would love to see the optimizations implemented to those tools.

    *significantly faster for doing actual work loads. It generally doesnt matter if something takes 0.05s or 0.07s.