I'm trying to extract some entries from a PDF, but the bad formatting is making it inconvenient to simply parse through like a normal document. There isn't any consistent positioning for the text, so each entry is a unique scramble with no consistent pattern I can find. I only want the entry name and the info on the right, not the field name or description.

An image of the PDF I'm trying to extract data from in the original format

I've tried experimenting with headers and layout info using the PyPDF2 Module but there doesn't seem to be any metadata for the PDF besides basic author info.

My idea was using the Google Cloud Vision API to transcribe the text, but that brings up issues of auto-positioning.

Does anyone know of a better methodology for this, or if not, simply how to execute the positioning for the Cloud Vision API?

0 Answers