PDF scraping, tabula py - columns do not correspond with "true" values of PDF file

83 views Asked by At

I get stuck again with PDF scraping and observe that columns do not correspond to some of the values that I obtain for those columns. Basically, I want to obtain a CSV file, but first I want to extract data in Data Frames on Python. Here are the first and second pdf files from which I want to obtain data.

Here is a code example tested on the second PDF provided above. This code is totally inspired by H. José Medeiros Filho in a previous post on Stack Overflow, in which he helped me a lot:

    # Thanks to H.José Medeiros Filho for his help! This chunk code is entirely from him.

   dfs_list = tabula.read_pdf("prices_cars_2017.pdf", 
                              pages="4-605",
                              lattice=True,
                              stream=True,
                              pandas_options={'header': None},
                              multiple_tables=True)
    
    df1 = pandas.DataFrame([])
    
    i=4 #page 4
    df_old = None # if breaks you might want to check it
    for df in dfs_list:
        if i != 4:
            df = df[df[1] != 'MARCA'] # I`ve tested and know page 150 somethin will have a column with this data
            idx_valor = df.apply(lambda x: x.eq('VALOR').any(), axis=1).idxmax() #I`ve tested and know that some columns will have VALOR in row. So I`m droping all rows before with Valor Included
            if idx_valor > 0:
                df = df.iloc[idx_valor+1:].reset_index(drop=True)
        df = df.dropna(axis=1, how='all')
        df.loc[:,'page'] = f'Page: {i}'
        if df.shape[1] > 12: #You can remove this if, is just did it to understand the data better
            print(f'Faile page {i}')
            break
        df.columns = range(1, len(df.columns) + 1) # Cols cant have different number
        df1 = pandas.concat([df1, df], ignore_index=True)
        df_old = df
        i += 1
    header_rows = df1.iloc[2:3, :].astype(str).replace('nan','')
    print('Header')
    display(header_rows)
    print('df1')
    df1.columns = header_rows.to_numpy().tolist()[0] #changing cols to headers
    df1 = df1.iloc[4:, :] #droping rows
    df1

Could anyone help me with this? I am really stuck, as the formatting of these files are quite similar, but differ among them.

0

There are 0 answers