1. 3

  2. 6

    PDF is the only electronic document format that fully supports redaction. […] There’s really no model for redaction of HTML-based web content.

    This seems disingenuous to me. I’ll grant them that you don’t often see redactions in HTML, but I don’t see why you couldn’t have something like

    <span class="redacted harm-to-ongoing-matter">██████████</span>

    where there is a .redacted CSS rule to black out the span, a .harm-to-ongoing-matter rule to display some text clarifying which kind of redaction this is, and every character in the span is U+2588 FULL BLOCK to give roughly the same effect in browsers that don’t support CSS. Or maybe it would be better for screen readers if you used “redacted” as the text of the span. Either way, I don’t think it’s true that PDF has some insurmountable advantage over HTML here.

    People have been conditioned to see these ugly, court-formatted PDFs as more official, so that’s a reason to keep using PDF, or at least to keep it as an option. But from a technical perspective, I don’t trust an HTML file served from the DOJ website over HTTPS any differently than I trust a PDF from the same source. And the client-side malleability of the display of HTML pages is a huge advantage ergonomically.

    1. 1

      I think what they’re aiming at is the fundamental difference between HTML describing content and PDFs describing pages, though. Even without redaction, it’s impossible with HTML to guarantee a certain look to the final document, because fonts, text rendering, and layout differ between (otherwise compliant) implementations.

      I imagine PDF over HTML makes it easier to do, for example, a side-by-side comparison between redacted and unredacted versions, and also communicate findings within a group. (“Take a look at page 9, the third paragraph…”)

      EDIT: I will take issue with the authenticity argument of the article. Freezing a document in any format should be done with a digital signature. There’s really no guarantee a PDF isn’t edited unless it’s signed, and PDF has no advantage over other formats in that regard. I think, ideally, Mueller would deliver a signed PDF, and Barr would deliver a signed and redacted PDF, with some sort of cryptographic hash of the original contained within.

      1. 2

        I guess I don’t understand why one would assume that “describing pages” is more important than “describing content” in this context, though. The Mueller report is effectively a single stream of text (with the occasional header and footnote), not a complicated layout like a magazine. I would argue that the accessibility benefits of HTML outweigh the consistent-look benefits of PDFs in this case (and in many other cases).

        Being able to compare two versions visually is definitely an advantage of PDF, but comparing two versions programmatically would be a lot easier with HTML. And the ability to specify exact locations visually in PDFs is nice, but remember that in HTML you have hyperlinks at your disposal—you can send a link to your collaborators and not have to worry about counting paragraphs! A properly marked up document would even allow you to say, “this span of text is page 7, line 9 of the canonical PDF version.”

        In an ideal world, I agree with what you say about signatures, but what if this report had been digitally signed? How would you verify that the key really belonged to who it said it belonged to… by looking at a fingerprint on the same site you downloaded the PDF from anyway? Maybe some kind of Extended Validation TLS certificate is the best we can do right now, given the current state of PKI.

    2. 3

      but the redaction software available to DoJ (see below) is fully effective at redacting born-digital PDF files,

      I’ve heard that before…