Two columns of PDF are coming as one while trying to read it using tabula-py

78 views Asked by Kartikey At 16 December 2023 at 10:23

I am trying to read a input pdf using tabula-py, below is the script which i am using

import tabula

pdf_path = '/Users/kartikeysinha/Desktop/local-git-repository/TestPDF.pdf'
tables_ = tabula.read_pdf(pdf_path, pages='all', guess=False)
df = tables_[0]

df.columns = df.iloc[0]
column_names = df.columns
print(column_names)

And I am getting following output as column names

Index([             'App ID Xref',                        nan,
                           'Date',                   'Broker',
       'Sub Broker Borrower Name',              'Description',
                              nan,                   'Amount',
                           'Rate',                  'Upfront',
                            'GST'],
      dtype='object', name=0)

But in the output you can see the first element which is App ID Xref are actually two different columns in my pdf, is there anyway I can workaround through it without changing the structure of the pdf I have recieved. < same for Sub Broker and Borrower Name >

Below is the attached image of the header of column in pdf

Also another small issue column names such as Settlement Date, Total Loan Amount, Common Rate... only has part of their names. How can i fix it?

Any help would be great, Thanks & Regards

Original Q&A

TechQA.

Two columns of PDF are coming as one while trying to read it using tabula-py

There are 0 answers

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in PDF

Related Questions in TABULA-PY

Popular Questions

Popular Tags

Trending Questions