Electoral Data analysis - OCR is not working

47 views Asked by Tony At 21 March 2024 at 07:51

I'm working on analysis voters data. Data is in scanned pdf and for analysis I need to convert them. I was able to do if the document in text pdf ( using pyPDF2). However, there are few documents where the voterlist is scanned copy, trying to perform OCR, however output is weird.

Below is my code:

import os
import PyPDF2
import re
import pandas as pd
import numpy as np
import fitz
import pytesseract
from PIL import Image, ImageEnhance, ImageFilter

pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

directory = r'C:\Users\AUS\Downloads\Electoral_roll'

for filename in os.listdir(directory):
    if filename.endswith(".pdf"):
        src_file_path = os.path.join(directory, filename)
        src_file = open(src_file_path,'rb')
        pdfreader = PyPDF2.PdfFileReader(src_file)
        num_pg = pdfreader.getNumPages()

        start_pno = 2
        end_pno = num_pg-1
        
        for pg in range(start_pno,end_pno):
            pageob = pdfreader.getPage(pg)
            
            pdf_data = pageob.extractText()
            
            pdf_doc = fitz.open(src_file_path)

            pdf_page = pdf_doc.load_page(pg)
            pix = pdf_page.get_pixmap()
            image = Image.frombytes("RGB",[pix.width,pix.height],pix.samples)
            text = pytesseract.image_to_string(image,lang='eng')
            print(text)
            

            try:
                dest_file = open('pdf_content.txt','a')
            except FileNotFoundError:
                dest_file = open('pdf_content.txt','w')

            dest_file.write(text)
            dest_file.close()

My PDF has scanned documents as below: (If you want to download the pdf file, you can download from here )

[![enter image description here][2]][2]

My output is as below:

I want the OCR to perform/extract the details like Name, House No, gender so I can use it for analysis. Just for ref here is my code which works fine for text pdf. But I'm struggling with pdf with scanned documents.

I'm looking for an insights or any workarounds in achieving what I was trying to do. Any insights or updated logic are much appreciated.

Original Q&A

TechQA.

Electoral Data analysis - OCR is not working

There are 0 answers

Related Questions in PYTHON

Related Questions in OCR

Related Questions in TESSERACT

Related Questions in PYTHON-TESSERACT

Popular Questions

Trending Questions