Is there a way to parse the form fields of signed PDFs e.g. using Python or Java and write them to a CSV?

461 views Asked by At

I would like to parse form fields from signed PDF's. With this I mean for example the checkboxes. I have already tried different ways (with Python) like PyPDF2, pikepdf or even pdfminer, however I only get the letters out and not the form fields. If someone has an approach how I could parse form fields from signed PDFs it would be my salvation. I can parse the individual letters, but not the form fields. I'm already thinking about trying OCR, but it seems very complicated to me and it might be easier.

Does anyone of you have an idea how I can parse the form fields out of signed PDF?

Thanks in advance!

2

There are 2 answers

3
Jorj McKie On

You can extract (but also manipulate) Form Fields with PyMuPDF - whether signed or not:

import fitz # the PyMuPDF package
doc = fitz.open("your.pdf")
for page in doc:  # iterate over pages
    print()
    print(f"Form fields on page {page.number}")
    for field in page.widgets():  # iterate over form fields on the page
        print(f"field type '{field.field_type_string}', value '{field.field_value}`")
2
Joris Schellekens On

disclaimer: I am the author of borb, the library used in this answer.

It's unclear what you want precisely.

  1. You want to extract information from the form fields in the PDF
  2. Your PDF is signed and then scanned, you want to extract an image of the signature

Either option is possible using borb

If you want to extract information of the form fields, I would recommend you look at section 4.4 of the examples repository. I'll post the example here for the sake of completeness.

from decimal import Decimal

from borb.pdf import HexColor
from borb.pdf import PageLayout
from borb.pdf import Document
from borb.pdf import Page
from borb.pdf import PDF


def main():

    # open document
    doc: typing.Optional[Document] = None
    with open("output.pdf", "rb") as pdf_file_handle:
        doc = PDF.loads(pdf_file_handle)
    assert doc is not None

    # get
    print("Name: %s" % doc.get_page(0).get_form_field_value("name"))
    print("Firstname: %s" % doc.get_page(0).get_form_field_value("firstname"))
    print("Country: %s" % doc.get_page(0).get_form_field_value("country"))


if __name__ == "__main__":
    main()

This example reads an input PDF, and then fetches the values of the form fields.

You can also do more low-level manipulations, borb represents the PDF as a JSON-like datastructure (nested arrays, dictionaries and primitives). So you can get the information relatively easily.

If you want to apply OCR to a PDF, I would recommend yet another example in the examples repository. This time in section 7.2.

from borb.pdf import Document
from borb.pdf import PDF
from borb.toolkit.ocr.ocr_as_optional_content_group import OCRAsOptionalContentGroup

from pathlib import Path


def main():

    # set up everything for OCR
    tesseract_data_dir: Path = Path("/home/joris/Downloads/tessdata-master/")
    assert tesseract_data_dir.exists()
    l: OCRAsOptionalContentGroup = OCRAsOptionalContentGroup(tesseract_data_dir)

    # read Document
    doc: typing.Optional[Document] = None
    with open("output_001.pdf", "rb") as pdf_file_handle:
        doc = PDF.loads(pdf_file_handle, [l])

    assert doc is not None

    # store Document
    with open("output_002.pdf", "wb") as pdf_file_handle:
        PDF.dumps(pdf_file_handle, doc)


if __name__ == "__main__":
    main()