1. 14

  2. 4

    It’s an OK approach for an introduction demo but it contains a number of incorrect statement.

    PDF is not a text-based format. It’s a misdirection. PDF is a binary format. Here’s what the spec says about it:

    PDF files are represented as sequences of 8-bit binary bytes. A PDF file is de- signed to be portable across all platforms and operating systems. The binary representation is intended to be generated, transported, and consumed directly, without translation between native character sets, end-of-line representations, or other conventions used on various platforms.

    On streams. Streams are weird in PDF. On one hand there’s a stream end mark (keyword) but the extent of a stream is specified by the dictionary entry /Length. So to read a stream one first needs to parse the dictionary, get the length, read the stream start marker, read the length number of bytes. Then check if the stream end marker is present and error out if not. On one hand this makes it a bit more robust. Also make it possible to include literal endstream in the data (contrary to what slide 100 states). On the other it makes writing a PDF parser a bit awkward. A parser has to be built only in a particular way: binary to semantic representation. It’s impossible to write a correct tokenizer for PDF, for example. It’s also impossible to use approach taken by other binary formats that use rigid structure, either. For example, BMP size is at a constant offset and of a constant size. This is not the case in PDF.