When I use PDFPlumber on table extraction, some tables are read vertically letter by letter in different cells, instead of an horizontal read inside of a cell..
The table has the next structure: It can be noted that it is not read correctly:
This is the script:
import pdfplumber
import pandas as pd
from openpyxl import Workbook
pdf_file_path = "CHILQUINTA.pdf"
def read_enel_0():
archivo = open('output.txt', 'w')
wb = Workbook()
with pdfplumber.open(pdf_file_path) as pdf:
for index, page in enumerate(pdf.pages):
tables = page.extract_tables()
nombre_hoja = 'Hoja' + str(index)
ws = wb.create_sheet(title=nombre_hoja)
for table in tables:
for elemento in table:
print(elemento)
archivo.write(str(elemento)+'\n')
#rows.append(row)
#df_temp = pd.DataFrame(rows)
#data_list = df_temp.values.tolist()
#for row_data in data_list:
#ws.append(row_data)
#wb.save('out_enel_0.xlsx')
archivo.close()
if __name__ == "__main__":
read_enel_0()
Is there some argument that it could be useful to correct this problem? , using PDFPlumber ideally.
PD: Tabula reads better the table, but I think I am omiting some functionality of PDFPlumber..
PD2: Example of Table: https://a.storyblok.com/f/82872/x/06a39e751a/suministro_chilquinta_202307.pdf