PDF to CSV - converted CSV has interchanged column Contents

Question

PDF to CSV - converted CSV has interchanged column Contents

163 views Asked by linux01 At 23 October 2023 at 10:03

I am trying to convert a PDF file into CSV using python and written below code for the same. Earlier it was working however recently its not working. I am getting interchanged column contents in the converted CSV file.

Guide me to fix this column issue in my code.

#!/usr/bin/env python3
import tabula
import pandas as pd
import csv

pdf_file='/pdf2xls/Input.pdf'
column_names=['Product','Batch No','Machin No','Time','Date','Drum/Bag No','Tare Wt.kg','Gross Wt.kg',
              'Net Wt.kg','Blender','Remarks','Operator']

# Page 1 processing
df1 = tabula.read_pdf(pdf_file, pages=1,area=(95,20, 800, 840),columns=[93,180,220,252,310,315,333,367,
                                                                      410,450,480,520]
                     ,pandas_options={'header': None}) #(top,left,bottom,right)

df1[0]=df1[0].drop(columns=5)
df1[0].columns=column_names
#df1[0].head(2)

#df1[0].to_csv('result.csv')

result = pd.DataFrame(df1[0]) # concate both the pages and then write to CSV
result.to_csv("/pdf2xls/Input.csv")

Original Q&A

There are 1 answers

**Timeless** · Accepted Answer · 2023-10-24T12:50:09+00:00

You can use pdfplumber:

# pip install pdfplumber
import pdfplumber

pdf = pdfplumber.open(pdf_file)
tables = pdf.pages[0].extract_tables()

(
    pd.DataFrame(
        # get the second table and skip the last three rows
        data=tables[1][:-3],
        # get the last row of the first table
        columns=tables[0][-1]
    )
    .replace("", float("nan")) # get rid of the empty strings
    # .to_csv("out.csv", index=False) # uncomment to make a fresh csv
)

Output :

   Product    Batch No Machin\nNo   Time        Date Drum/\nBag\nNo Tare\nWt.kg Gross\nWt.kg Net\nWt.kg  Blender Operator
0    L1050  23JJ0AL051     WB-102  01:07  16-10-2023              1       57.20      1398.80    1341.60      NaN     Amit
1    L1050  23JJ0AL051     WB-102  01:22  16-10-2023              2       57.40      1398.80    1341.40      NaN     Amit
2    L1050  23JJ0AL051     WB-102  01:33  16-10-2023              3       58.20      1399.60    1341.40      NaN     Amit
3    L1050  23JJ0AL051     WB-102  01:44  16-10-2023              4       58.80      1400.60    1341.80      NaN     Amit
4    L1050  23JJ0AL051     WB-102  01:55  16-10-2023              5       57.20      1399.00    1341.80      NaN     Amit
..     ...         ...        ...    ...         ...            ...         ...          ...        ...      ...      ...
20   L1050  23JJ0AL051     WB-102  05:42  16-10-2023             21       57.40      1398.60    1341.20      NaN     Amit
21   L1050  23JJ0AL051     WB-102  05:52  16-10-2023             22       57.40      1399.00    1341.60      NaN     Amit
22   L1050  23JJ0AL051     WB-102  06:00  16-10-2023             23       57.40      1398.80    1341.40      NaN     Amit
23   L1050  23JJ0AL051     WB-102  06:10  16-10-2023             24       57.80      1399.60    1341.80      NaN     Amit
24   L1050  23JJ0AL051     WB-102  06:19  16-10-2023             25       57.80      1399.40    1341.60      NaN     Amit

[25 rows x 11 columns]

TechQA.

PDF to CSV - converted CSV has interchanged column Contents

There are 1 answers

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in CSV

Related Questions in PDFMINER

Related Questions in TABULA-PY

Popular Questions

Trending Questions