Electoral Data analysis - OCR is not working

47 views Asked by At

I'm working on analysis voters data. Data is in scanned pdf and for analysis I need to convert them. I was able to do if the document in text pdf ( using pyPDF2). However, there are few documents where the voterlist is scanned copy, trying to perform OCR, however output is weird.

Below is my code:

import os
import PyPDF2
import re
import pandas as pd
import numpy as np
import fitz
import pytesseract
from PIL import Image, ImageEnhance, ImageFilter

pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

directory = r'C:\Users\AUS\Downloads\Electoral_roll'

for filename in os.listdir(directory):
    if filename.endswith(".pdf"):
        src_file_path = os.path.join(directory, filename)
        src_file = open(src_file_path,'rb')
        pdfreader = PyPDF2.PdfFileReader(src_file)
        num_pg = pdfreader.getNumPages()

        start_pno = 2
        end_pno = num_pg-1
        
        for pg in range(start_pno,end_pno):
            pageob = pdfreader.getPage(pg)
            
            pdf_data = pageob.extractText()
            
            pdf_doc = fitz.open(src_file_path)

            pdf_page = pdf_doc.load_page(pg)
            pix = pdf_page.get_pixmap()
            image = Image.frombytes("RGB",[pix.width,pix.height],pix.samples)
            text = pytesseract.image_to_string(image,lang='eng')
            print(text)
            

            try:
                dest_file = open('pdf_content.txt','a')
            except FileNotFoundError:
                dest_file = open('pdf_content.txt','w')

            dest_file.write(text)
            dest_file.close()

My PDF has scanned documents as below: (If you want to download the pdf file, you can download from here )

[![enter image description here][2]][2]

My output is as below:

enter image description here

I want the OCR to perform/extract the details like Name, House No, gender so I can use it for analysis. Just for ref here is my code which works fine for text pdf. But I'm struggling with pdf with scanned documents.

I'm looking for an insights or any workarounds in achieving what I was trying to do. Any insights or updated logic are much appreciated.

0

There are 0 answers