Here is my code it already is scraping the area I want but part of the output seems to be encoded. I've tried changing to a different for example encoding='Latin1' etc but to no luck. wanted to see if anyone had any suggestions.
import tabula
pdf_path="sample.pdf"
tables = tabula.read_pdf(
pdf_path,
pages="2",
encoding='utf-8',
relative_area=True,
relative_columns=True,
area=[37,10,100,40]
)[0]
print(tables)
tables.to_csv("output.csv", index=False)
tables.to_excel("Exceltest1.xlsx")
Example output:
Title: {M{$eamgbrteitrle}}
"Date: {N{\*omv 6er, c2h0a2n3t_signed_date_es\_:signer1:date"
I've tried working with Beautiful Soup but haven't had much luck. I tried the following to see the extracted text
and the text looks the same. Might try OCR?
import re
from pdfminer.high_level import extract_pages, extract_text
text = extract_text("sample.pdf")
print(text)