Two columns of PDF are coming as one while trying to read it using tabula-py

77 views Asked by At

I am trying to read a input pdf using tabula-py, below is the script which i am using

import tabula

pdf_path = '/Users/kartikeysinha/Desktop/local-git-repository/TestPDF.pdf'
tables_ = tabula.read_pdf(pdf_path, pages='all', guess=False)
df = tables_[0]

df.columns = df.iloc[0]
column_names = df.columns
print(column_names)

And I am getting following output as column names

Index([             'App ID Xref',                        nan,
                           'Date',                   'Broker',
       'Sub Broker Borrower Name',              'Description',
                              nan,                   'Amount',
                           'Rate',                  'Upfront',
                            'GST'],
      dtype='object', name=0)

But in the output you can see the first element which is App ID Xref are actually two different columns in my pdf, is there anyway I can workaround through it without changing the structure of the pdf I have recieved. < same for Sub Broker and Borrower Name >

Below is the attached image of the header of column in pdf

enter image description here

Also another small issue column names such as Settlement Date, Total Loan Amount, Common Rate... only has part of their names. How can i fix it?

Any help would be great, Thanks & Regards

0

There are 0 answers