pdfminer doesn't extract data from filled-out pdf form

2k views Asked by At

I'm trying to use pdfminer to extract the filled-out contents in a pdf form. The instructions for accessing the pdf are:

  1. Go to https://www.ffiec.gov/nicpubweb/nicweb/InstitutionProfile.aspx?parID_Rssd=1073757&parDT_END=99991231
  2. Click "Create Report" next to the fourth report from the top (i.e.,Banking Organization Systemic Risk Report (FR Y-15))
  3. Click "Your request for a financial report is ready"

To extract the contents in blue, I copied code from this post:

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdftypes import resolve1

filename = 'FRY15_1073757_20160630.PDF'
fp = open(filename, 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)
fields = resolve1(doc.catalog['AcroForm'])['Fields']

for i in fields:
    field = resolve1(i)
    name, value = field.get('T'), field.get('V')
    print '{0}: {1}'.format(name, value)

This didn't extract the data fields as expected -- nothing was printed. I tried the same code on another pdf and it worked so I suspect the failure might have to do with the security setting of the first pdf, which is shown below enter image description here

For the second pdf on which the code worked, the security setting shows "Allowed" for all the actions. I also tried using pdfminer's pdf2txt.py functionality (see here) but the filled-out data in the fields in the original pdf form (which is what I want) was not in the converted text file; only the "flat" non-fillable part of the pdf was converted. Interestingly, if I use Adobe Reader's Save As Text to convert the pdf to a text file, the fillable part was in the converted text file. This is what I've been doing to get around the failed code.

Any idea how I can extract data directly from the pdf form? Thanks.

1

There are 1 answers

3
mkl On

I can only explain what the problem is but cannot present a solution because I have no working Python knowledge.

Your code iterates over the immediate children of the AcroForm Fields array and expect them to represent the form fields.

While this expectation often is fulfilled, it actually only represents a special case: Form fields are arranged as a tree structure with that Fields array as root element, e.g. in case of your sample document there is large tree:

Fields tree

Thus, you have to descend into the structure, not merely iterate over the immediate children of Fields, to find all form fields.