Need to extract the specific text only from Invoice PDF file having different PDF structure using python and store the output data into particular excel columns. All the PDF files have different structure but same content values.
Tried to solve it but not able to extract the specific text values only.
Sample PDF file :
Need to Extract Invoice ID, Issue Date, Subject, Amount Due from the whole PDF file.
Script i have used so far:
import PyPDF2
import re
pdfFileObj = open('test.pdf','rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
text = str(pageObj.extractText())
quotes = re.findall(r'"[^"]*"',text)
print(quotes)
You have a very nice pdf document, because your pdf has form fields, so you can use them directly to read the data:
EDIT: I combined your requested data (from here: How to extract only specific text from PDF file using python) in a little script with 3 opportunities of parsing the pdf (for your 3 pdfs). The problem is your pdfs have a lot of differences and the packages have some advantages on different pdfs, so i think you have to combine this stuff. The thing is, that you try all functions, till it gets a result. I hope this is an good start for you. You may have to change the regexes, if you have more different pdfs and may you have to store all regex (per field) in an array and use them on the different functions so you have 3 functions for parsing and 4 lists of regexes to use in 2 of the functions.