Scraping table from a PDF using tabula. Getting random encoded output

35 views Asked by Gallo1999 At 10 November 2023 at 00:32

Here is my code it already is scraping the area I want but part of the output seems to be encoded. I've tried changing to a different for example encoding='Latin1' etc but to no luck. wanted to see if anyone had any suggestions.

import tabula

pdf_path="sample.pdf"

tables = tabula.read_pdf(
    pdf_path,
    pages="2",
    encoding='utf-8',
    relative_area=True,
    relative_columns=True,
    area=[37,10,100,40]
)[0]
print(tables)

tables.to_csv("output.csv", index=False)
tables.to_excel("Exceltest1.xlsx")

Example output:

Title: {M{$eamgbrteitrle}}

"Date: {N{\*omv 6er, c2h0a2n3t_signed_date_es\_:signer1:date"

I've tried working with Beautiful Soup but haven't had much luck. I tried the following to see the extracted text

and the text looks the same. Might try OCR?

import re
from pdfminer.high_level import extract_pages, extract_text

text = extract_text("sample.pdf")
print(text)

Original Q&A

TechQA.

Scraping table from a PDF using tabula. Getting random encoded output

There are 0 answers

Related Questions in PYTHON

Related Questions in EXTRACT

Related Questions in TABULA

Popular Questions

Popular Tags

Trending Questions