I have some python 3 script to extract text from pdf's using PyPDF2 module. I am searching for digits/numerical values for example 534,000.00. However, when I extract text from the pdf and put it into a string the number looks like this 534,000.00 with the three zero's following the comma as regular zero's (0) but the two after the decimal as 0̸0̸.
Am I missing something here?
When I copied the 534,000.00 from the pdf to this form it looked like: 534,OOO.00. I'm not sure what is going on.
sample code for just 1 pdf:
for file in os.listdir(file_path):
if file[-7:] == "303.PDF":
with open(file_path + file, 'rb') as pdfobj:
pdfReader = PyPDF2.PdfFileReader(pdfobj, strict=False)
num_pages = pdfReader.numPages
while count < num_pages:
pageobj = pdfReader.getPage(count)
text += pageobj.extractText()
count += 1
# prints nothing
if re.search('534,000.00', text):
print("found it")
# finds it correctly
if re.search('534,OOO.00', text):
print("found it")
Your
OOO's are three lettersOas inObject, not the digit 'zero' (0).Don't ask me, why someone would use letters instead of digits on purpose or if some text-recognition-program thought, these had to be letters...
You could use
[0O]in all your regex to match both.