1. 17

  2. 4

    Tools to query piles of flat files, like Trino if you’re self-hosting, AWS’s Athena or Redshift Spectrum, or (going by agent281’s comment) Azure Data Warehouse, can be really handy. A lot depends on what you’re trying to do, but queries that were going to scan a whole table anyway (or a whole partition) completed faster against files in S3 than on our well-provisioned prod DB server. Of course, the flipside is it simply isn’t set up to seek around in the data, or to do writes (updates or even non-bulk appends).

    A term to search for about this stuff is “data lake,” but that obscures that It’s Just Files, perhaps lightly massaged to help efficient querying (partitioning or bucketing by date or other variables you often filter on, or tweaking the format).

    Also, even compressed text tends to be more compact than live tables in a database. Even with your DB’s compression on, it has to store secondary indexes, leave empty space in pages to efficiently write new rows, etc. And many tools including Presto/Trino and Athena read delimited text (CSV, TSV) or object-per-line JSON, all optionally gzipped. (Athena can query AWS application load balancer logs, for instance, which are more or less .tsv.gz.)

    I did end up using pyorc to convert the .tsv.gz’s I initially generated into ORC, a newer columnar format similar to Parquet. It was 2x denser than the .tsv.gz, and had some side benefits–e.g. properly typed values and standardized the handling of things like newlines and NULLs.

    But mostly want to say that compared to my prior expectations, dumping data into some more compact format was both easier than I expected and sacrificed less than I figured in terms of ease of querying.

    1. 3

      Does anyone know if there is a good way to use parquet files in python without using a data frame? I have a number of pipelines that stream data to files, but I’m not a huge fan of pandas. (The lack of nullable ints (until recently) really turned me off of it.) Last time I checked it didn’t seem like there were any good options.

      1. 3

        https://github.com/uber/petastorm (not sure if it uses pandas under the hood or not)

        1. 1

          It looks like they are using pyarrow under the covers.

        2. 3

          There is PyArrow that has a Table. Would that work for you?


          1. 2

            I’ll check it out! We try to stream data in our pipelines to keep memory overhead low (as much as that’s possible in python) so I’ve been hesitant to use table based solutions. I know that it’s possible to stream pandas via HDF5, but again, I was a bit turned off because of other issues. So I’m admittedly a bit ignorant of the space as a result.

          2. 2

            ORC is a similar columnar compressed format and pyorc reads/writes it with a nice minimal interface.

            1. 1

              That’s the sort of interface that I was hoping for. It looks like ORC is a compatible data format for Azure Data Warehouse (er, Synapse Analytics) too. I’ll have to test it out. Thanks!

              1. 2

                Yep! For anyone else reading, you can query ORC via SQL with Presto/Trino if you’re self hosting, and if you’re in Amazon-land it works with AWS’s Athena (their Presto deployment) and Redshift (for import or via Spectrum).

            2. 2

              pyspark is a very heavy tool, but it’s got a much better type system than Pandas. I think when you say “streaming”, you would be satisfied by Spark’s laziness, but Spark also supports streaming Kafka (queue-style message processing) or Delta tables (stream changes to a table).

            3. 2

              What size would the sqlite file be if it was also gzipped?

              1. 7

                Parquet can read directly from the compressed file and decompress on the fly, though.

                1. 3

                  471MB zipped.

                  1. 1

                    Or anyone seriously investigated this? https://sqlite.org/zipvfs/doc/trunk/www/readme.wiki

                    1. 1

                      It’s not very popular due to the license and fees. Here is an interesting project should work really well with the zstd dict compression. Since there is tons of repeated values in some columns, or json text, zstd’s dict based compress works really well.

                      [2020-12-23T21:14:06Z INFO sqlite_zstd::transparent] Compressed 5228 rows with dictid=111. Total size of entries before: 91.97MB, afterwards: 1.41MB, (average: before=17.59kB, after=268B)

                      1. 1

                        I would want to investigate using ZSTD (with a dictionary) and Parquet as well. Columnar formats are better for OLAP.

                      2. 1

                        Why would anybody investigate this when you have Parquet and ORC specifically designed for OLAP use cases. What does Sqlite add to this problem that a data scientist wants to process the Github export data?

                      3. 1

                        Apples to oranges.

                      4. 2

                        To be fair to SQLite, some normalization of this schema would have helped a lot, such as creating tables for licenses and repo URLs and referring to those with foreign keys. That would have reduced those columns in the main table to “a few” bytes per row (SQLite uses varints so I can’t say exactly; probably 1 byte for the license, 2 or 3 for the URL?)

                        1. 2

                          In row-based systems, that’s your only option. In column-based systems, you can normalize or rely on run-length-encoding. I do a lot of data engineering, and Parquet is great because I frequently don’t have control over the schema or I don’t have multi-table transactions.

                          1. 2

                            Normal forms were created in an era where disk space was hilariously expensive. You are right for an OLTP use case, even today. However, OLAP usecases require flat schemas with no JOINs, because it is much simpler to query. The RCFile white paper investigated what is the best scenario for OLAP and they concluded that columnar formats are the best. There was also a lot of effort put into columnar optimizations (RLE, dict encoding, etc.) that makes normalisations largely irrelevant. Why would you split to many tables when you can realise the same benefits without having multiple tables?

                          2. 2

                            What program generated that little ascii bar chart at the end of the post?