1. 16
  1.  

  2. 1

    I love this blog post. I like the narrative of an exploratory journey, the dead-ends and head scratching, and the humbleness. And I like how the author didn’t just stop at the format - they kept going and going into parsing the XML then the language itself. That kind of end-to-end depth is rarely shown in blog posts.

    Just some advice for future data spelunkers in the future, and I’m sure the blog post author already knows this. When they said:

    Scanning through the file with xxd, I did not find any visible structure…The first simple thing to try was to just attempt to find a ZIP file by going through the bytes of the file one by one. After all, it was very plausible that the file starts with some Apple header followed by some compressed data.

    Most file formats have “magic headers”, some prefix of one or two or four bytes that “uniquely” (lol if only) identify the format. Here a nice reference to get started with: https://en.wikipedia.org/wiki/List_of_file_signatures. So when I saw the hex output I was thinking “hmm f74d….hmm 78da”. And 78da happens to be a magic header for ZLIB with best compression.

    But if you have a big set of magic headers, how do you efficiently find them in a blob of bytes from the start? I faced a similar problem at work a while ago, trying to find JPEG’s in a stream of bytes that can’t be loaded into memory and may only be scanned once, and settled on using a Rabin-Karp string search for multiple prefixes. [1] [2]. Rabin-Karp is easy to implement. I’m sure there are other interesting solutions.

    The disadvantage of “searching for prefixes from the start then stopping when you find one” is just because you find a prefix doesn’t mean that format is there, you need to do the hard work of attempting to deserialize it, and if it fails remember that that (prefix, byte pointer) point is not interesting and skip it. This is the advantage of the author’s approach, eventually you need to get in there and ZLIB-decompress from some starting point.

    [1] Rabin-Karp single pattern implementation: https://algs4.cs.princeton.edu/53substring/RabinKarp.java.html

    [2] For Rabin-Karp multiple pattern search, if multiple patterns do not all share the same length you hash on the shortest length pattern then on hash collisions iterate through patterns whose hash matches, e.g. https://github.com/glkz/rabinkarp/blob/master/rabinkarp.go