PdfPlumber reads tables vertically

80 views Asked by At

When I use PDFPlumber on table extraction, some tables are read vertically letter by letter in different cells, instead of an horizontal read inside of a cell..

The table has the next structure: enter image description here It can be noted that it is not read correctly: Image

This is the script:

import pdfplumber
import pandas as pd
from openpyxl import Workbook
pdf_file_path = "CHILQUINTA.pdf"
def read_enel_0():
    archivo = open('output.txt', 'w')
    wb = Workbook()
    with pdfplumber.open(pdf_file_path) as pdf:
        for index, page in enumerate(pdf.pages):
            tables = page.extract_tables()
            nombre_hoja = 'Hoja' + str(index)
            ws = wb.create_sheet(title=nombre_hoja)
            for table in tables:
                for elemento in table:
                    print(elemento)
                    archivo.write(str(elemento)+'\n')       
                #rows.append(row)
                #df_temp = pd.DataFrame(rows)
                #data_list = df_temp.values.tolist()
                #for row_data in data_list:
                #ws.append(row_data)
    #wb.save('out_enel_0.xlsx')
    archivo.close()
if __name__ == "__main__":
    read_enel_0()

Is there some argument that it could be useful to correct this problem? , using PDFPlumber ideally.

PD: Tabula reads better the table, but I think I am omiting some functionality of PDFPlumber..

PD2: Example of Table: https://a.storyblok.com/f/82872/x/06a39e751a/suministro_chilquinta_202307.pdf

0

There are 0 answers