I trying to perform OCR using Tesseract OCR on multiple big pdf files (~400-600 pages). I don't necessarily want to extract text from all pages, but I just want a few pages (page numbers are known). The PDF file seems to have some sort of OCR already performed on it, but it isn't a good job. When I run this code that I wrote in Jupyter:
import pdf2image
from PIL import Image
import pytesseract
import cv2
import numpy as np
pytesseract.pytesseract.tesseract_cmd = r"C:/Program Files/Tesseract-OCR/tesseract.exe"
images = pdf2image.convert_from_path("test2.pdf", first_page=3, last_page=3, poppler_path=r"C:/Program Files/poppler-0.68.0/bin")
images[0].show()
I see this output: [
This is what the output should look like:
I do think that the OCR that was done on the PDF is causing some problems here. I am not sure how to bypass it, can someone please help?
I also tried OCR by manually converting the page into an image (snipping tool), and the OCR engine worked. I also tried playing with the options on pdf2image.convert_from_path()
like without the poppler_path
option, or other pages. I tried reading another PDF file, WHICH DID NOT HAVE OCR PERFORMED ON IT, and it seemed to work.
I had the same issue. Since I was unable to fix it, I decided to go with another library.
With the help of another Stack Overflow post and some Googling I was able to modify Mohit Chandel's function to transform a pdf (with multiple pages) in jpg's