1. 10
  1.  

  2. 3

    This is an important topic for me. Thank you for this great writeup.

    I used HDF5 to also store some electrophysiological data and am currently using HDf5 to store genomic variant data. HDF5 seemed like a great way to store larger than memory data sets (I think it’s used a lot in the experimental physics community). You can access sections of the data etc. However, there are some gotchas I ran into (like this). In my new project where I don’t have extremely large files, at most several GB, but I’ve experienced weird corruption issues on occasion. It was also not clear to me how to do parallel writes, even though having data in separate “folders” in the HDF5 file would seem to be parallel write-able.

    I’m glad some one took the time to investigate HDF5 thoroughly. I wonder if there is a response from the experimental physics community, which I think uses this extensively.

    PS. For those who would like a tldr; it’s a little bit hidden in the article, but the author suggests using Numpy arrays or bare binary arrays to store the data, memory map them for access and use .json documents to carry metadata.

    1. 1

      Many filesystems do allow you to attach metadata to directories via extended attributes.

      I wonder how different the story would have been if there had been a HDF5 implementation in a safe language available. In particular that would presumably have eliminated the crashing-and-corrupting-files problems.

      1. 1

        It still wouldn’t solve the speed issues (see the bit about memmap and numpy arrays) because they seem to be intrinsic to the format.

        1. 2

          Actually it sounds like the native HDF5 client library offers the mmap functionality, it’s just that the h5py wrapper doesn’t support it.

          Interaction with OS caching could be an issue but probably not the 100x issue that mmap is. IMO it should be the OS' responsibility to update its caching to handle application access patterns; if HDF5 took off I’m sure we’d see good OS support for it.

          1. 2

            After I posted this article last night, a colleague of mine double-checked some of the claims in the blog post. It looks like the blog author is incorrect about the 100x issue. Here is a trivially modified version of his notebook that illustrates the fix:

            https://gist.github.com/d7a06cc1e09054fabd3d

            1. 1

              It looks like the numpy one is also loading all of the data. Curious how fast that is when it doesn’t. Probably not significantly faster, but removing one significant op from h5py but not numpy doesn’t make for a good comparison

                1. 2

                  So I ran a benchmark with more iterations on %%timeit and an additional expirement showing performance when you want to compute an aggregate over a dataset (just np.sum in this case).

                  In this case, np.memmap took 276ms on a 10,000 x 10,000 matrix, while h5py took 4.68s.

                  So for doing small slices, h5py and np.memmap are comparable. However, if you need a significant portion of the data (even if not all at once), I think np.memmap is pretty much always going to come out ahead. I would be interested to see what random access of the dataset looks like with each, or a plot of access time vs proportion of dataset sliced. However, I stopped here because it was enough to convince me that memmap is much faster for my usual use-cases.

                  https://gist.github.com/emallson/0e9ff54c14c85ba486e3

                  There is a table of final timing results at the bottom.