I have around 970 pdf files with same format and i want to extract the table from these pdf's. after doing some research i am able to extract table using tabula-area argument, Unfortuntely the area parameters are not same for each pdf hence i cannot iterate . So, if anyone can help me with automating finding this area arguments for each pdf it would be great help.
as you can see in image i have to use area otherwise the junk in header is also parsed. Here is the script i am able to execute successfully for first pdf, but i need to extract from 970files which is not possible manually. PLS. HELP!!
@author: Jiku-tlenova
"""
import numpy as np
import matplotlib as plt
import pandas as pd
import os
import re
import PyPDF2 as rdpdf
import tabula
path = "/codes/python/"
os.chdir(path)
from convert_pdf_to_txt import convert_pdf_to_txt
os.getcwd()
pa="s/"
os.chdir(path+pa)
files= os.listdir(".")
ar=[187.65,66.35,606.7,723.11]
tablist=[]
for file in files:
i=0
pgnum=2;endval=0
weind=re.findall("\d+", file)
print(file)
reader = rdpdf.PdfFileReader(file)
while endval==0:
table0 =tabula.read_pdf(file, pages = i+2, spreadsheet=True,multiple_tables = False ,lattice=True,area=ar) #pandas_options={'header': 'infer'}
table0=table0.dropna(how="all",axis=1)
#foramtiing headers
head=(table0.iloc[0,:]+table0.iloc[1,:]).T
table0.columns=head
table0=table0.drop([0, 1])
table0=table0.iloc[:-1] #delete last row - not needed
mys=table0[table0.columns[-1]]
val=mys.isnull().all()
if val==True:
endval=1
tablist.append(table0)
i=i+1```
finally able to do it myself....basically took code from R and used wrapper....seems R support community is much active in stack than python one.....thanks