1. 27
  1.  

  2. 2

    I don’t disagree with the advice here, but in general you are probably better off just using SQLite as your data file format. It’s a great replacement for open().

    1. 2

      I disagree a database is not always the solution. especially in fields like genomics where importing into a SQLite takes forever given the size of these datasets. Instead binary file formats have saved the day, for example, BigWig and BigBeds

      1. 1

        I agree that SQLite is not perfect for everything, I said IN GENERAL, there are definitely cases where SQLite is not the right answer. It sounds like your example may be one such example. So I agree with your disagreement, SQLite is not always the solution, but that’s exactly what I said. :)

        From my perspective, it’s probably worth starting off in SQLite, to ignore file format issues, so you can focus on your actual problem, and only investigate alternatives, such as your own file format when you hit road blocks SQLite can’t help you out of. But for most cases, those are few and far between in my experience. Definitely multi-terabyte files are probably not a great use-case for SQLite.

    2. 2

      Interesting read for anyone looking for relevant advice or new aspects when designing your own binary file format.

      Some historical points might be slightly less relevant today - e.g. detecting CR/LF conversions instead of using checksums - but most advice given is very solid. Especially on versioning and common pitfalls such as dumping memory structures to disk.

      I would add that it is a good idea to look into modern serialization formats such as Protocol Buffers or FlatBuffers to avoid reinventing the wheel. I currently plan on combining a simple header with one of those for my next open source project.

      1. 1

        Pretty solid. I was recently tasked with reverse engineering some binary file format and if the creator had used half of these hints I’d been a happier person :P

        1. 1

          I’m in a similar boat, where the file uses exactly none of these aside from values being consistently little-endian. Of course the article was written a decade or so after this format was designed though…