1. 30
  1. 8

    Hey @itamarst, I saw this paragraph on your about page: “My big picture goal: [..] fight climate change. Faster software means less CO₂ emissions; speeding up scientific computing allows us to speed up the creation of new solutions, from vaccines to more efficient energy grids.”

    Just as a piece of stranger-on-the-internet feedback that this blog post is getting shared among our data scientists, who work on controlling grid loads to match available renewable energy supply, like optimizing EV charger schedulers and heat pump / home heating models for hundreds of thousands of EU homes.

    So, you’re having exactly the kind of impact you’re hoping for, thanks for sharing your knowledge!

    1. 4

      Ooh, thank you for letting me know!

    2. 4

      Never heard of the Parquet data format before but it looks pretty interesting. Anyone ever used it in practice?

      1. 10

        I use Parquet at work all the time, though largely through Delta tables, so most of the complexity is abstracted away. Main selling points talked online are row-columnar layout (row group boundaries can be tuned to e.g. store images), page compression and dictionary encodings. But what I found really useful after using it for a while is an excellent support for nested schemas and arrays, implemented by so-called “definition” and “repetition” levels. Dremel, on which Parquet is based, was built specifically as “interactive ad-hoc query system for analysis of read-only nested data”. At first everyone on the team wanted to normalize the tables to avoid those complex types, but we quickly found that ingesting heavily nested JSON without flattening improved both ingestion experience, and reader experience, without bringing down performance of either. There are other similar formats, like ORC, but Parquet is actively developed (they’re adding first-class Bloom filters right now!), and Apache Spark is well intergrated with it.

        1. 8

          I use it daily. But to me, the format seems a bit chaotic, since different readers may support different feature sets. For example, when I tried writing a parquet file with categorical values and delta encodings, with addition of zstd compression, that file was only readable by my go program, not with pandas nor Dremio nor Clickhouse :(

          Edit: this is a really great article outlining some of the features in parquet: https://wesmckinney.com/blog/python-parquet-multithreading/

          1. 6

            I have, used it with Spark and AWS Athena, it works quite well.

            Its main selling points (compared to something like compressed json) :

            • It’s column based. That has a lot of advantages, but to me the main one is that readers can read only a subset of columns and it will cost about the same as if the file contained only these. It also enables some efficient compression of numeric columns.

            • Files are split in blocks, and the start of blocks in the file is sorted in the file footer. This means that if a file is split in 10 blocks, 10 workers can read in concurrently without needing any synchronization.

            • Predicate pushdown. Each block contains some metadata about the value of columns in the block (for ex min/max). This enables readers to skip blocks that will not matter to them. For example if you have an age column and are looking for rows with age > 10, if max is 9 in the metadata you skip the block.

            • Self describing : the schema is stored in the file, so the file is self-sufficient.

            It’s my go-to for data warehousing.

            I wouldn’t want to be implementing a reader though!

            1. 3

              Anyone ever used it in practice?

              I know you’re probably asking for real details, but it’s also worth saying it’s the de facto standard “big data” file format. Basically all modern “big data” systems support importing parquet first class, or directly use parquet as their preferred storage format (usually in an external blob store like S3).