I get stuck again with PDF scraping and observe that columns do not correspond to some of the values that I obtain for those columns. Basically, I want to obtain a CSV file, but first I want to extract data in Data Frames on Python. Here are the first and second pdf files from which I want to obtain data.
Here is a code example tested on the second PDF provided above. This code is totally inspired by H. José Medeiros Filho in a previous post on Stack Overflow, in which he helped me a lot:
# Thanks to H.José Medeiros Filho for his help! This chunk code is entirely from him.
dfs_list = tabula.read_pdf("prices_cars_2017.pdf",
pages="4-605",
lattice=True,
stream=True,
pandas_options={'header': None},
multiple_tables=True)
df1 = pandas.DataFrame([])
i=4 #page 4
df_old = None # if breaks you might want to check it
for df in dfs_list:
if i != 4:
df = df[df[1] != 'MARCA'] # I`ve tested and know page 150 somethin will have a column with this data
idx_valor = df.apply(lambda x: x.eq('VALOR').any(), axis=1).idxmax() #I`ve tested and know that some columns will have VALOR in row. So I`m droping all rows before with Valor Included
if idx_valor > 0:
df = df.iloc[idx_valor+1:].reset_index(drop=True)
df = df.dropna(axis=1, how='all')
df.loc[:,'page'] = f'Page: {i}'
if df.shape[1] > 12: #You can remove this if, is just did it to understand the data better
print(f'Faile page {i}')
break
df.columns = range(1, len(df.columns) + 1) # Cols cant have different number
df1 = pandas.concat([df1, df], ignore_index=True)
df_old = df
i += 1
header_rows = df1.iloc[2:3, :].astype(str).replace('nan','')
print('Header')
display(header_rows)
print('df1')
df1.columns = header_rows.to_numpy().tolist()[0] #changing cols to headers
df1 = df1.iloc[4:, :] #droping rows
df1
Could anyone help me with this? I am really stuck, as the formatting of these files are quite similar, but differ among them.