1. 18
  1. 5

    Nice post! I’m kind of curious if pandas.read_csv(...) is faster when you specify the column types via dtype={...}. I guess I’d expect it to be but not sure how much. Didn’t know about PyArrow either.

    1. 2

      I’m a huge fan of your work, Itamar! Thanks for yet again providing useful bits of information! As an aside: is this some sort of oversight or implicit bias that Itamar is able to practically always post his work, but someone like Adam Johnson^1 gets a comment saying that it is not okay?

      1. 2

        I am literally going to use this at work right now, specifically the pyarrow bit!

        1. 1

          We’ve switched to using feathers instead of parquets at work, and they’re snappy as hell. I highly recommend them!

          1. 2

            I tried Feather, and for this example it was slower. But presumably it depends a lot on data structure…

            https://arrow.apache.org/faq/ has some notes on choosing a format that was informative.

            1. 3

              Having worked on both implementation, parquet can be “slower” if it needs to decompresses a lot and disk pages are not cached. Feather feels snappier if disk cache is hot since there’s no decoding, just a giant mmap.

              1. 1

                I found some more recent performance comparisons between parquet, feather, and other formats, and apparently LZ4-compressed feathers are somehow even faster?! I’m curious if anyone around here has any idea why this might be true.

                1. 2

                  In general you should benchmark with your own particular data. Given varying compressability for different data, disk I/O bandwidth, different access to parallelism, different read patterns, etc. etc. it’s hard to generalize (given reasonably designed file formats). CSV is not reasonably designed, so it’s easier to just reject it in first pass.