I’ve recently had reasons to extract text from PDFs, and it’s a really hard thing to do right. I haven’t turned to OCR yet, but I might have to.
I recently tried to do something simple “some pages start with a name” and even a simple thing like getting the first line from each page in an 800 page pdf was a hassle.
I hadn’t worked with pdf any time I could remember and I assumed there would be some kind of regex for pdf.
The PDF spec authors tried to overcome this issue by adding what is called “tagged PDF”. If a PDF creation application knows words, paragraphs, etc. it can provide tagged information in the PDF about them.
The way it works is that special instructions are inserted into the content streams that make up a page and these special instructions mark paragraphs, headers, etc. (similar to how XHTML encloses content in a tag). The information itself is not stored in the content streams but as standard PDF objects.
If a PDF text extraction application knows about tagged PDF, the quality of text extraction should be near perfect.
Any GitHub / FLOSS project recommendation?
Most linux distros package the Xpdf tools, including pdftotext and pdfimages. I’ve used them, they worked OK for my use case but YMMV.
Once you have a glimpse at the eldrich horror that is the ISO PDF spec, your expectations will be reasonably low. Many real-world PDF’s don’t even follow that spec, which may explain why it doesn’t attempt to provide a ‘method for validating conformance’.
Poppler does a bunch of useful things with PDF files, I used it to extract images from PDF this week. It’s also what I’m using for a personal project.