Extracting field labels and details from IRS XFA/AcroForm using Python

67 views Asked by At

I am currently working with IRS forms (U.S. Internal Revenue Service), which are in PDF format, specifically XFA or AcroForm. My aim is to extract not only the field names but also the corresponding field labels where the user is expected to input their values.

I understand that libraries such as PyPDF2 and Aspose-PDF can be used to extract form field details in Python. However, these libraries seem to only provide the field names (like "f1_01"), and I haven't found a way to extract the corresponding field labels (the text displayed to the user on the form, such as "First Name") using these libraries.

For instance, in a form with a field labeled "First Name" that corresponds to "f1_01", I want to map "First Name" to "f1_01".

Could anyone suggest a method or a different library in Python that could help me extract this information from an IRS form? I would greatly appreciate any assistance or pointers in the right direction. Aspose-PDF is currently only able to give me the field details like "f1_01", but not the labels. Also, I cannot use IText due to license constraints.

Link to IRS form: https://www.irs.gov/pub/irs-pdf/f1065sk3.pdf

Thank you!

Here is the PyPDF2 code that I tried:

Code1:
import PyPDF2 as pypdf
def findInDict(needle, haystack):
    for key in haystack.keys():
        try:
            value=haystack[key]
        except:
            continue
        if key==needle:
            return value
        if isinstance(value,dict):            
            x=findInDict(needle,value)            
            if x is not None:
                return x
pdfobject=open("input.pdf",'rb')
pdf=pypdf.PdfReader(pdfobject)
xfa=findInDict('/XFA',pdf.resolved_objects)
xml=xfa[7].get_object().get_data()
with open('output.xml', 'w') as f:  
    f.write(xml.decode('utf-8'))    

Code2:
from PyPDF2 import PdfReader  
  
def scan_fields(path):  
    pdf = PdfReader(path)  
    fields = pdf.get_fields()  
    for key in fields:  
        print(key)  
  
scan_fields('input.pdf')  

Here is the Aspose-PDF code that I tried:

import aspose.pdf as ap
 
license = ap.License()
license.set_license("Aspose.TotalProductFamily.lic")
 
pdfDocument = ap.Document("input.pdf")
 
    # Get values from all fields
for formField in pdfDocument.form.fields:
    # Analyze names and values if need
    print(f"Partial Field Name : {formField.partial_name}, Full Field Name : {formField.full_name}, Value : {str(formField.value)}")

0

There are 0 answers