I am trying to read a input pdf using tabula-py, below is the script which i am using
import tabula
pdf_path = '/Users/kartikeysinha/Desktop/local-git-repository/TestPDF.pdf'
tables_ = tabula.read_pdf(pdf_path, pages='all', guess=False)
df = tables_[0]
df.columns = df.iloc[0]
column_names = df.columns
print(column_names)
And I am getting following output as column names
Index([ 'App ID Xref', nan,
'Date', 'Broker',
'Sub Broker Borrower Name', 'Description',
nan, 'Amount',
'Rate', 'Upfront',
'GST'],
dtype='object', name=0)
But in the output you can see the first element which is App ID Xref
are actually two different columns in my pdf, is there anyway I can workaround through it without changing the structure of the pdf I have recieved. < same for Sub Broker and Borrower Name >
Below is the attached image of the header of column in pdf
Also another small issue column names such as Settlement Date, Total Loan Amount, Common Rate... only has part of their names. How can i fix it?
Any help would be great, Thanks & Regards