Disappointed they didn’t consider SQLite. I think it’s an ideal format for this kind of thing - it’s compact, robust, massively widely-supported and lets you store multiple tables in the same binary file.
It does have one significant ddisadvantage: it doesn’t support a “date” type, so you would need to store your dates as unix timestamps or ISO strings and parse them back into dates when you load it.
If you really must use some separated text format use TSV instead of CSV. Tab is so much less likely to occur in your dataset than a comma which makes quoting and escaping a lot more friendly on the eyes.
I still find kinda strange that even ASCII has a Record Separator (ASCII code 30) defined, yet nobody uses it. Probably because nobody can type it on their keyboard.
I went in thinking “Umm parquet if you can?” and read the article, which settled primarily on “Use parquet most of the time unless you have a good reason not to, in which case use something that isn’t pandas until you need to use pandas” and then high fived the article.
Disappointed they didn’t consider SQLite. I think it’s an ideal format for this kind of thing - it’s compact, robust, massively widely-supported and lets you store multiple tables in the same binary file.
It does have one significant ddisadvantage: it doesn’t support a “date” type, so you would need to store your dates as unix timestamps or ISO strings and parse them back into dates when you load it.
JSON and CSV have the same problem though.
Interesting idea, I will add that to my article idea list. My suspicion is that Pandas will have terrible memory usage efficiency from loading it.
I think you know exactly what I’m going to link to: avoid potato programming and migrate your data transformations from Pandas to SQL!
Polars’ lazy mode is another alternative to Pandas’ imperative API. And there is a use case for imperative APIs, which is why Polars has both: https://pythonspeed.com/articles/polars-exploratory-data-analysis-vs-production/
then use duckdb instead: https://duckdb.org/2021/05/14/sql-on-pandas.html
If you really must use some separated text format use TSV instead of CSV. Tab is so much less likely to occur in your dataset than a comma which makes quoting and escaping a lot more friendly on the eyes.
I still find kinda strange that even ASCII has a Record Separator (ASCII code 30) defined, yet nobody uses it. Probably because nobody can type it on their keyboard.
I went in thinking “Umm parquet if you can?” and read the article, which settled primarily on “Use parquet most of the time unless you have a good reason not to, in which case use something that isn’t pandas until you need to use pandas” and then high fived the article.