Scraping table from a PDF using tabula. Getting random encoded output

42 views Asked by At

Here is my code it already is scraping the area I want but part of the output seems to be encoded. I've tried changing to a different for example encoding='Latin1' etc but to no luck. wanted to see if anyone had any suggestions.

import tabula

pdf_path="sample.pdf"

tables = tabula.read_pdf(
    pdf_path,
    pages="2",
    encoding='utf-8',
    relative_area=True,
    relative_columns=True,
    area=[37,10,100,40]
)[0]
print(tables)

tables.to_csv("output.csv", index=False)
tables.to_excel("Exceltest1.xlsx")

Example output:

Title: {M{$eamgbrteitrle}}

"Date: {N{\*omv 6er, c2h0a2n3t_signed_date_es\_:signer1:date"

I've tried working with Beautiful Soup but haven't had much luck. I tried the following to see the extracted text

and the text looks the same. Might try OCR?

import re
from pdfminer.high_level import extract_pages, extract_text

text = extract_text("sample.pdf")
print(text)
0

There are 0 answers