So does anyone know if they deliberately make the document non-searchable or is that just an accident?
It’s probably a scanned document that was originally faxed after being printed and snail mailed using Fedex (because USPS sucks) to the person in the next cubicle.
The pdf contains images not text….
Those are pictures of the pages embedded in a pdf, in the past I have used pdftk to split a similar file into individual pages and then convert them to bitmap images which I have run through gocr to convert them back to text. It wasn’t 100% accurate but it beat retyping the pdf.
I myself use an old copy of Acrobat for that. Amusing ancedote: I once OCRed a scan of a doctor’s typed notes though, and a lot of it got mangles into a captcha like form. I guess a doctor’s writing will always be impossible to decode regardless of format.
Well yes, I knew that, and I don’t really need the text. I was just wondering if anyone knew whether this was standard practice for congressional committees, or had any theories on why it is the case.
It’s generally standard practice in government now to used scanned copies of documents, rather than the documents themselves. That way, if there’s redacted text, there’s no way to get at the redacted text. Prior to using scanned copies, you’d sometimes see documents that were redacted with a black box, where you could still highlight then copy/paste the text underneath. The black box would just be a layer on top of text, similar to layering in Photoshop/Gimp.