Extracting tables in 1000's of PDF using tabula area argument

Question

Extracting tables in 1000's of PDF using tabula area argument

274 views Asked by Pri At 30 September 2020 at 12:27

I have around 970 pdf files with same format and i want to extract the table from these pdf's. after doing some research i am able to extract table using tabula-area argument, Unfortuntely the area parameters are not same for each pdf hence i cannot iterate . So, if anyone can help me with automating finding this area arguments for each pdf it would be great help.

as you can see in image i have to use area otherwise the junk in header is also parsed. Here is the script i am able to execute successfully for first pdf, but i need to extract from 970files which is not possible manually. PLS. HELP!!

@author: Jiku-tlenova
"""
import numpy as np
import matplotlib as plt
import pandas as pd
import os
import re
import PyPDF2 as rdpdf
import tabula
path = "/codes/python/"
os.chdir(path)
from convert_pdf_to_txt import convert_pdf_to_txt
os.getcwd()
pa="s/"
os.chdir(path+pa)

files= os.listdir(".")
ar=[187.65,66.35,606.7,723.11]

tablist=[]

for file in files:
    i=0
    pgnum=2;endval=0
    weind=re.findall("\d+", file)
    print(file)
    reader = rdpdf.PdfFileReader(file)
    while endval==0:
        table0 =tabula.read_pdf(file, pages = i+2, spreadsheet=True,multiple_tables = False ,lattice=True,area=ar) #pandas_options={'header': 'infer'}
        table0=table0.dropna(how="all",axis=1)
       
#foramtiing headers
        head=(table0.iloc[0,:]+table0.iloc[1,:]).T
        table0.columns=head
        table0=table0.drop([0, 1])
        table0=table0.iloc[:-1] #delete last row - not needed
        mys=table0[table0.columns[-1]]
        val=mys.isnull().all()
  
        if val==True:
            endval=1 
        tablist.append(table0)
        i=i+1```

Original Q&A

There are 1 answers

**Pri** · Answer 1 · 2020-10-08T05:25:30+00:00

Pri On 08 October 2020 at 05:25

finally able to do it myself....basically took code from R and used wrapper....seems R support community is much active in stack than python one.....thanks

TechQA.

Extracting tables in 1000's of PDF using tabula area argument

There are 1 answers

Related Questions in PYTHON

Related Questions in PDF

Related Questions in TABULA

Popular Questions

Popular Tags

Trending Questions