PDF to CSV - converted CSV has interchanged contents of the columns

119 views Asked by At

I am trying to convert a PDF file into CSV using Python and the below code. Earlier it was working; however, recently it is not working. I am getting interchanged column contents in the converted CSV file.

Guide me to fix this column issue in my code. I am only concerned about the first page of the PDF conversion as I need to remove the first rows of the table.

#!/usr/bin/env python3
import tabula
import pandas as pd
import csv

pdf_file='/pdf2xls/Input.pdf'
column_names=['Product','Batch No','Machin No','Time','Date','Drum/Bag No','Tare Wt.kg','Gross Wt.kg',
              'Net Wt.kg','Blender','Remarks','Operator']

# Page 1 processing
df1 = tabula.read_pdf(pdf_file, pages=1,area=(95,20, 800, 840),columns=[93,180,220,252,310,315,333,367,
                                                                      410,450,480,520]
                     ,pandas_options={'header': None}) #(top,left,bottom,right)

df1[0]=df1[0].drop(columns=5)
df1[0].columns=column_names
#df1[0].head(2)

#df1[0].to_csv('result.csv')

result = pd.DataFrame(df1[0]) # concate both the pages and then write to CSV
result.to_csv("/pdf2xls/Input.csv")
2

There are 2 answers

1
Timeless On BEST ANSWER

Assuming your pdf have always at least two pages with a footer in the last one, you can try :

# pip install pdfplumber
import pdfplumber
import pandas as pd

pdf = pdfplumber.open("23JJ0WL139.pdf")

tables = []
for p in pdf.pages:
    ta = p.extract_tables()[0]
    if str(p) == "<Page:1>":
        header = ta[4]
        tables.append(pd.DataFrame(ta[5:]))
    else:
        tables.append(pd.DataFrame(ta))
                    
df = pd.concat(tables).iloc[:-3].set_axis(header, axis=1)

Output :

print(df)

   Product    Batch No Machin\nNo  ... Net\nWt.kg Blender Operator
0    GC950  23JJ0WL139     WB_101  ...      51.40            Anand
1    GC950  23JJ0WL139     WB_101  ...      51.60            Anand
2    GC950  23JJ0WL139     WB_101  ...      51.20            Anand
3    GC950  23JJ0WL139     WB_101  ...      51.20            Anand
4    GC950  23JJ0WL139     WB_101  ...      51.80            Anand
..     ...         ...        ...  ...        ...     ...      ...
11   GC950  23JJ0WL139     WB_101  ...      51.60            RAHUL
12   GC950  23JJ0WL139     WB_101  ...      51.60            RAHUL
13   GC950  23JJ0WL139     WB_101  ...      51.80            RAHUL
14   GC950  23JJ0WL139     WB_101  ...      51.40            RAHUL
15   GC950  23JJ0WL139     WB_101  ...      51.80            RAHUL

[140 rows x 11 columns]
2
Jorge Dri On

It appears that the issue with your code is related to column interchange in the converted CSV file. This problem could be caused by the way you are specifying columns when using tabula.read_pdf. You can modify the column coordinates to ensure that each column is correctly identified and extracted.

#!/usr/bin/env python3
import tabula
import pandas as pd

pdf_file = '/pdf2xls/Input.pdf'
column_names = ['Product', 'Batch No', 'Machin No', 'Time', 'Date', 'Drum/Bag No', 'Tare Wt.kg', 'Gross Wt.kg',
               'Net Wt.kg', 'Blender', 'Remarks', 'Operator']

# Page 1 processing
df1 = tabula.read_pdf(pdf_file, pages=1, area=(95, 20, 800, 840), columns=[93, 180, 220, 252, 310, 355, 380, 410,
                                                                     450, 480, 520], pandas_options={'header': None})

df1[0].columns = column_names
df1[0] = df1[0].drop(0)  # Remove the first row
result = pd.DataFrame(df1[0])
result.to_csv("/pdf2xls/Input.csv", index=False)  # Set index=False to avoid writing the index column

In this modified code:

  1. I've adjusted the column coordinates in the columns parameter to ensure that each column is correctly identified.

  2. I've removed the first row using df1[0] = df1[0].drop(0) to get rid of the header row.

  3. I added index=False when saving the DataFrame to CSV to prevent the index from being written as a separate column.