What is the best approach to extract only some columns of the tables

54 views Asked by At

I' writing a program. Users can upload PDF files. Every file has kind of a table. Every table has columns like(product name, unit, price), some files might have more than those three columns. I need to save data from three columns of every pdf file to my database. My problem is I don't know which approach is the best for my project. I googled a lot and didn't find answers for my problem.

So the 3 PDFs I'm working with.

IMG: PDF1 PDF2 PDF3

Pdfplumber approach showed me, tables are nested.

pdf1 = [[['WARZYWA', None, None], ['Nazwa produktu:', 'j.m.', 'Cena Netto:'], ['BAKŁAŻAN', 'kg', '8,51'], ['BROKUŁ 500G', 'szt', '5,29'], ['BRUKSELKA', 'kg', '0,00'], ['BURAK', 'kg', '2,19'], ['BURAK ŻÓŁTY', 'kg', '11,50'], ['BOCZNIAK', 'kg', '19,55'], ['BOTWINA PĘCZEK', 'szt', '3,80'], ['CEBULA', 'kg', '1,61'], ['CEBULA CHALLOT', 'kg', '11,50'], ['CEBULA CZERWONA', 'kg', '2,99'], ['CUKINIA KAL. 14', 'kg', '9,20'], ['CHRZAN KORZEŃ', 'kg', '16,10'], ['CYKORIA CZERWONA 250G', 'szt', '0,00'], ['CYKORIA ZIELONA 500G', 'szt', '6,56'], ['CZOSNEK', 'kg', '23,00'], ['CZOSNEK', 'szt', '1,84'], ['CZOSNEK CZARNY 100G', 'szt', '9,20'], ['DYNIA', 'kg', '0,00'], ['DYNIA HOKKAIDO', 'kg', '0,00'], ['DYNIA MAKARONOWA', 'kg', '0,00'], ['DYNIA PIŻMOWA', 'kg', '5,75'], ['FASOLKA SZPARAGOWA ZIELONA', 'kg', '19,55'], ['FASOLKA SZPARAGOWA ŻÓŁTA', 'kg', '0,00'], ['GROSZEK CUKROWY 250G', 'szt', '12,65'], ['GRZYBY SHIMEJI 150G', 'szt', '7,25'], ['IMBIR', 'kg', '12,65'], ['KABACZEK', 'kg', '0,00'], ['KALAREPA', 'szt', '2,59'], ['KALAFIOR KAL. 6', 'szt', '7,82'], ['KAPUSTA BIAŁA', 'kg', '2,76'], ['KAPUSTA CZERWONA', 'kg', '3,22'], ['KAPUSTA MŁODA', 'szt', '5,75'], ['KAPUSTA PAK CHOI 2SZT', 'szt', '7,48'], ['KAPUSTA PEKIŃSKA MŁODA', 'kg', '5,75'], ['KARCZOCHY', 'szt', '7,13'], ['KOPER', 'kg', '26,45'], ['KOPER WŁOSKI', 'kg', '14,49']]] 
pdf2 = [['PRODUKT', 'Kraj poch.', 'JEDNOSTKA', 'CENA NETTO', 'VAT %', 'CENA BRUTTO', None], ['NOWOŚCI', None, None, None, None, None, None], ['CUKINIA ŻÓŁTA IMPORT', 'Holandia', 'kg', '8,50 zł', '0', '8,50 zł', ''], [None, None, None, None, None, None, ''], ['DONICZKA TYMIANEK CYTRYNOWY', 'Polska', 'szt.', '9,60 zł', '8', '10,37 zł', ''], [None, None, None, None, None, None, ''], ['DONICZKA WERBENA CYTRYNOWA', 'Polska', 'szt.', '11,90 zł', '0', '11,90 zł', ''], [None, None, None, None, None, None, ''], ['MIKRO PĘDY CZOSNKU 50SZT.', 'Polska', 'op.', '8,90 zł', '0', '8,90 zł', ''], [None, None, None, None, None, None, ''], ['PUREE Z MIRABELKI 1 KG', 'Francja', 'op.', '49,72 zł', '0', '49,72 zł', ''], [None, None, None, None, None, None, ''], ['SZPARAGI BIAŁE 500G PL', 'Polska', 'szt.', '17,65 zł', '0', '17,65 zł', ''], [None, None, None, None, None, None, ''], ['SZPARAGI FIOLETOWE KG', 'Holandia', 'kg', '24,06 zł', '0', '24,06 zł', ''], [None, None, None, None, None, None, ''], ['SZPARAGI ZIELONE 500G PL', 'Polska', 'szt.', '18,44 zł', '0', '18,44 zł', ''], [None, None, None, None, None, None, ''], ['WARZYWA', None, None, None, None, None, None], ['BAKŁAŻAN', 'Holandia', 'kg', '10,27 zł', '0', '10,27 zł', ''], [None, None, None, None, None, None, ''], ['BOTWINA PĘCZEK', 'Polska', 'kg.', '15,24 zł', '0', '15,24 zł', ''], [None, None, None, None, None, None, ''], ['BROKUŁ 500G', 'Hiszpania', 'szt.', '8,02 zł', '0', '8,02 zł', ''], [None, None, None, None, None, None, ''], ['BROKUŁ KG', 'Hiszpania', 'kg', '16,04 zł', '0', '16,04 zł', ''], [None, None, None, None, None, None, ''], ['BRUKIEW W', 'ielka Brytani', 'a kg', '4,01 zł', '0', '4,01 zł', ''], [None, None, None, None, None, None, ''], ['BURAK CHIOGGIA', 'Holandia', 'kg.', '14,44 zł', '0', '14,44 zł', ''], [None, None, None, None, None, None, ''], ['BURAK PL', 'Polska', 'kg', '3,69 zł', '0', '3,69 zł', ''], [None, None, None, None, None, None, ''], ['BURAK ŻÓŁTY', 'Holandia', 'kg.', '14,44 zł', '0', '14,44 zł', ''], [None, None, None, None, None, None, ''], ['BURAKI CAŁE GOTOWANE 500G', 'Francja', 'op.', '4,34 zł', '0', '4,34 zł', ''], [None, None, None, None, None, None, ''], ['CEBULA 75-105 IMP', 'Holandia', 'kg', '2,65 zł', '0', '2,65 zł', ''], [None, None, None, None, None, None, ''], ['CEBULA CZERWONA IMP', 'Holandia', 'kg', '4,34 zł', '0', '4,34 zł', ''], [None, None, None, None, None, None, ''], ['CEBULA SHALLOT', 'Francja', 'kg', '16,85 zł', '0', '16,85 zł', ''], [None, None, None, None, None, None, ''], ['CEBULKA PERŁOWA SREBRNA', 'Holandia', 'kg', '19,25 zł', '0', '19,25 zł', ''], [None, None, None, None, None, None, ''], ['CHRZAN IMP', 'Niemcy', 'kg.', '24,06 zł', '0', '24,06 zł', ''], [None, None, None, None, None, None, ''], ['CUKINIA ZIELONA IMP', 'Hiszpania', 'kg', '8,82 zł', '0', '8,82 zł', ''], [None, None, None, None, None, None, ''], ['CYKORIA CZERWONA', 'Holandia', 'kg', '48,12 zł', '0', '48,12 zł', ''], [None, None, None, None, None, None, ''], ['CYKORIA ZIELONA 500G', 'Holandia', 'op.', '7,38 zł', '0', '7,38 zł', ''], [None, None, None, None, None, None, ''], ['CZOSNEK SZT', 'Polska', 'szt.', '2,89 zł', '0', '2,89 zł', ''], [None, None, None, None, None, None, ''], ['DYNIA PIŻMOWA', 'Hiszpania', 'kg.', '5,13 zł', '0', '5,13 zł', ''], [None, None, None, None, None, None, ''], ['FASOLKA SZEROKA', 'Maroko', 'kg', '25,66 zł', '0', '25,66 zł', ''], [None, None, None, None, None, None, ''], ['FASOLKA SZPARAGOWA ZIELONA IMP', 'Senegal', 'kg', '25,66 zł', '0', '25,66 zł', ''], [None, None, None, None, None, None, ''], ['GALANGAL', 'Tajlandia', 'kg', '153,97 zł', '0', '153,97 zł', ''], [None, None, None, None, None, None, '']] 
pdf3 - IndexError: list index out of range # but I can extract-text from this file. 

This is simple function I wrote that works with data above

# data is nested list I got from pdfplumber extract_table() 
def get_data_by_column_header(data, column_header):
    header_row = data[0]
    print(header_row)
    try:
        column_index = header_row.index(column_header)
    except ValueError:
        return []  
    column_data = [row[column_index] for row in data[1:]]
    return column_data

column_header = 'PRODUKT'
result = get_data_by_column_header(data2, column_header)
print(result)

And it kinda works but I can't figure out how to write a program that will parse from every PDF user upload. Should I check in a nested list for the keywords and then parse by index? But in pdf3 it won't work, because the pdfplumber doesn't see any table in there.

Also tryed with tabula-py. Tabula can see pdf3 as a table.

But in lattice=True, shows only columns headers:

[Empty DataFrame Columns: [Nr produktuNazwaj.m.ProducentCena nettoVAT [%]Cena brutto] 

with stream=True, doesn't see headers but shows data:

[ Unnamed: 0 Polewy, syropy Unnamed: 1 ... Unnamed: 3 Unnamed: 4 Unnamed: 5 0 NaN Syropy NaN ... NaN NaN NaN 1 NaN NaN NaN ... NaN NaN NaN 2 I136:13792 Syrop żurawinowy 1L SZT ... 35,61 8.0 38,46 3 I136:12984 Syrop zielona mięta 1L SZT ... 26,72 8.0 28,86 ... 

This is the function which will work with tabula and would extract column I want.

def extract_column_from_pdf(pdf_file, column_name):
    df = tabula.read_pdf(pdf_file, pages=1, stream=True)[0]
    # df = tabula.read_pdf(pdf_file, pages=1, lattice=True)[0]

    if column_name in df.columns:
        extracted_column = df[column_name]
        return extracted_column
    else:
        return None

column_name = "PRODUKT" 

result = extract_column_from_pdf(pdf2, column_name)
if result is not None:
    print(result)
else:
    print(f"Column not found in the PDF.")

Also works with one file because of those nested columns. This is my main issue. How to parse threw tables by the column name and get all that from this column? Are there any other approaches to solve my column issue? Maybe there is a simpler way to extract column I want to save to the database?

0

There are 0 answers