I need to extract data from a single pdf file with just 1 page, which has the following structure:
The numbers of subcolumns may vary from column to columns, as well as the number of rows. There also might be missing (empty) data in some of the columns.
*For clarity purposes, there are sub-subcolumns missing from the structure (each subcolumn always has 3 subcolumns)
The code I use is this:
import tabula.io as tb
import pandas as pd
def toPDFPag2(pathPDF, nPag, pathxlsx):
table = tb.read_pdf(pathPDF,multiple_tables=True)
df = pd.concat(table)
#df.to_excel(pathxlsx, sheet_name='Sheet 1')
It usually works, just needing a bit of manual formatting later.
However sometimes it fails to extract the data. It gets only one row or it misses rows. What can I do to fix this?