1. 52
  1. 22

    Does anyone, anywhere ever get taught how to design a file format? It seems a giant blind spot that people seldom talk about, unless like this person they end up needing to parse or emit a particularly hairy one.

    A while ago I discovered RIFF and was just like “why are we not using this everywhere?”

    1. 19

      In university it came as a natural side effect of OS Design (file-systems and IPC) and network communication (device independent exchange). It’s enough for a start and then go by cautionary tales and try to figure out which ones apply in your particular context. You’ll be hard pressed to find universal truths here to be ‘taught’. Overgeneralise and you create a poor file-system (zip), overspecialise and it’s not format, it’s code.

      The latter might be a bit surprising, but take ZSTD in dictionary mode. You can noticeably increase information density by training it on case specific data. The resulting dictionary need to go somewhere as it is not necessarily a part of the bitstream. The decoding stage needs to know about the dictionary. Do you preshare it and embed it in the application or put it as part of the bitstream. Both can be argued, both have far reaching consequences.

      The master-level for file formats, if you need something to study, I’d say is media container formats e.g. MKV. You have multiple data streams of different sizes, some are relevant and some are to be skipped and it is the consumer that decides. Seeking is often important and the reference frames may be at highly variable offsets. There are streaming / timing components as your spinning disk media with a 30Gb file has considerable seek times and rarely enough bandwidth and caches. They are shared in contexts that easily introduces partial corruption that accumulates over time, a bit flip in a subtitle stream shouldn’t make the entire file unplayable and so on.

      RIFF as an example is a TLV (tag-length-value). These are fraught with dangers. It is also the one that everyone comes up with and it has many many names. I won’t spoil or go into it all here, part of the value is the journey. Follow RIFF to EXIF to OMP and see how the rationale expands to “make sense” when you suddenly have a Jpeg image with a Base64 encoded XML indexed Jpeg image inside of it as part of metadata block. Look at presentations by Ange Albertini ( e.g. Funky File Formats: https://www.youtube.com/watch?v=hdCs6bPM4is ), Meredith Patterson ( Science of Insecurity: https://www.youtube.com/watch?v=3kEfedtQVOY ) and Travis Godspeed ( Packets in Packets: https://www.youtube.com/watch?v=euMHlV6MNqs).

      1. 18

        Being a self-taught programmer, I think the study of file formats is underrated. I only learned file format parsing to help me write post-exploitation malware that targets the ELF file format. I also used to hack on ClamAV for a job, and there I learned better how to parse arbitrary file formats in a defensive way–such that malware cannot target the AV itself.

        I’m in the process of writing a proprietary file format right this very moment for ${DAYJOB}. The prior version of the file format was incredibly poorly designed, rigid, and impossible to extend in the future. I’m grateful for the lessons ELF and ClamAV taught me, otherwise I’d likely end up making the same mistakes.

        1. 15

          There’s a field of IT security called “langsec” that’s basically trying to tell people how to design file formats that are easier to write secure parsers for. But it’s not widely known and as far as I can tell usually not considered when designing new formats.

          I think this talk gives a good introduction: https://www.youtube.com/watch?v=3kEfedtQVOY

          1. 10

            The laziest answer is don’t bother and let Sqlite be your on-disk file format. Then you also get interop with any geek wanting to mess about with your data, basically free.

            It’s certainly not ideal in some situations, but it’s probably a good sane default for most situations.

            sqlite links about it: https://sqlite.org/affcase1.html and https://sqlite.org/fasterthanfs.html

            That said, I agree it would be great to have nice docs about various tradeoffs in designing file formats. So far the best we seem to have are gotcha posts like this one.

            1. 3

              Or CBOR, flatbuffers/capnproto/etc, just any existing solid serialization format. If storing just “regular” data. Things like multimedia come with special requirements that might make reusing these formats difficult.

              1. 2

                There are three use cases that make designing a file format difficult:

                • Save on one platform / architecture, load on another (portability).
                • Save on one version of your program, load on a newer one (backwards compatibility).
                • Save on one version of your program, load and modify on an older one (forwards compatibility).

                Of these, SQLite completely fixes the portability problem by defining platform and architecture-agnostic data types. It transforms the other two from file format design problems into schema design problems. Backwards compatibility is fairly simple to preserve in both cases, read the file / database and write out the new version. It may be slightly easier to provide a schema migration query in SQLite than maintain the old reader and the new writer for a custom file format, but you’re also likely to end up with a more complex schema for a SQlite-based format than something custom. It can help a bit with forwards compatibility. This is normally implemented in custom formats by storing the version of the creator and requiring unknown record types to be preserved so that a new version of the program can detect a file that contains records saved by a program that didn’t know what they meant and fix up any changes. It may be possible for foreign key constraints and similar in SQLite to avoid some of this but it remains a non-trivial problem.

              2. 10

                Excellent point — same thing goes for network protocols, though they’re less common.

                I learned a lot from RFC 3117, “On The Design Of Application Protocols” when I read it about 20 years ago. It’s part of the specs for an obsolete protocol called BEEP, but it’s pretty high level and goes into depth on topics like how to frame variable length records, which is relevant to file formats as well. Overall it’s one of the best-written RFCs I’ve seen, and I highly recommend it.

                1. 5

                  IFF, the inspiration for RIFF, was used everywhere on the Amiga, more or less.

                  1. 2

                    Its ubiquity also had the advantage that you could open iffparse.library to walk through any IFF-based format instead of writing your own (buggy) format parser.

                  2. 4

                    I had the same question when I’ve learned about the structure of ASN.1, which is probably only used to store cryptographic data in certificates (maybe there are some other uses, but I haven’t seen any), but probably can be used anywhere really (it’s also a TLV structure).

                    1. 7

                      ASN.1 is used heavily in telecommunications and is used for SNMP and LDAP. Having implemented more of SNMP than I care to remember and worked with some low level telecoms stuff, ASN.1 gives me nightmares. I know the reasons for it but it’s definitely more complicated than it seems…

                    2. 3

                      I don’t think “how to design a file format” is often taught but I’ve been taught many examples of file and packet formats with critiques about what parts were good or bad.

                      RIFF itself may not be common but its ideas are; PNG most notably. Also BEEP/BXXP, a now dead 2000-era packet format. But these days human readable delimited formats like JSON and XML are more in fashion.

                      The reality is no product succeeds or fails on the quality of its data formats. Their fate is determined by other forces and then whatever formats they use are what we are stuck wtih.

                    3. 3

                      its history is equally soiled - https://www.youtube.com/watch?v=uNXCd2EATSo

                      1. 2

                        Is the self-extracting portion of zip files actually machine code? Does that mean that ZIP is an executable file format? Isn’t this awful for portability?

                        1. 3

                          Normally ZIP files don’t contain the self-extracting portion, it’s only used when someone wants to build an “installer”, or distribute the ZIP file to someone who might not have a ZIP unpacker.

                          And yes, most likely the machine code will be only for one OS, but I don’t think that it’s an issue. In case the user tries to unpack the self-extracting ZIP archive on an unsupported OS, it’s normally possible to use a ZIP unpacking application that works on this OS to unpack this self-extracting ZIP archive just as if it would be a normal ZIP archive.

                        2. 2

                          You might think this is nonsense but you have to remember, pkzip comes from the era of floppy disks. Reading an entire zip file’s contents and writing out a brand new zip file could be an extremely slow process. In both cases, the ability to delete a file just by updating the central directory, or to add a file by reading the existing central directory, appending the new data, then writing a new central directory, is a desirable feature.

                          Pretty much all databases do this sort of incremental modification. Deleting a (large) record just unhooks it from the directory/tree; its space will be reclaimed later. In an append-only database like CouchDB or CouchStore, all changes result in appending new data and then a new directory (aka B-tree root.)

                          Of course Zip probably isn’t paying attention to durability — modifying a file in place is fraught with peril, since it can corrupt the file if interrupted by a crash/panic/power failure. Don’t try this at home!