Why is pdf2image giving me a blank image file?

1.3k views Asked by At

I trying to perform OCR using Tesseract OCR on multiple big pdf files (~400-600 pages). I don't necessarily want to extract text from all pages, but I just want a few pages (page numbers are known). The PDF file seems to have some sort of OCR already performed on it, but it isn't a good job. When I run this code that I wrote in Jupyter:

import pdf2image
from PIL import Image
import pytesseract
import cv2
import numpy as np

pytesseract.pytesseract.tesseract_cmd = r"C:/Program Files/Tesseract-OCR/tesseract.exe"
images = pdf2image.convert_from_path("test2.pdf", first_page=3, last_page=3, poppler_path=r"C:/Program Files/poppler-0.68.0/bin")
images[0].show()

I see this output: [Output from images[0].show()1

This is what the output should look like: Input image

I do think that the OCR that was done on the PDF is causing some problems here. I am not sure how to bypass it, can someone please help?

I also tried OCR by manually converting the page into an image (snipping tool), and the OCR engine worked. I also tried playing with the options on pdf2image.convert_from_path() like without the poppler_path option, or other pages. I tried reading another PDF file, WHICH DID NOT HAVE OCR PERFORMED ON IT, and it seemed to work.

3

There are 3 answers

0
EHR On

I had the same issue. Since I was unable to fix it, I decided to go with another library.

With the help of another Stack Overflow post and some Googling I was able to modify Mohit Chandel's function to transform a pdf (with multiple pages) in jpg's

import ghostscript
import locale

def pdf2jpeg(pdf_input_path, jpeg_output_path):
    """
    Source: https://stackoverflow.com/questions/60701262/convert-pdf-to-image-using-python, 
    https://www.kite.com/python/answers/how-to-remove-everything-after-a-character-in-a-string-in-python, 
    https://www.ghostscript.com/doc/current/Use.htm
    """
    args = ["pef2jpeg", # actual value doesn't matter
            "-dNOPAUSE",
            "-sDEVICE=jpeg",
            "-r144",
            "-sOutputFile=" + jpeg_output_path.split(".", 1)[0] + "-%d.jpg",
            pdf_input_path]

    encoding = locale.getpreferredencoding()
    args = [a.encode(encoding) for a in args]

    ghostscript.Ghostscript(*args)
0
K J On

There is nothing wrong with the source OCR, in fact it is better than most similar examples, true there is a glitch here and there but that's due to the source quality thus to be expected and I suspect a second pass would fare much worse.

Here is the source with OCR text highlighted enter image description here

Here is the OCR (which is readable as searchable text), represented as an image which you suggest you desire to run a second time but all you can do is get worse, never better unless you type any characters that are missing or malformed.

enter image description here

And here it is as TEXT exported to WordPad

First Edition, 5,000 Copies, November 1972 

© The Navajivan Trust, 1972 

Principal collaborators: 
Shankar Prasada, ics (retd.) 
Special Secretary, Kashmir Affairs (1958-65) 
Chief Commissioner of Delhi (1948-54) 

B. L. Sharma 
Former Principal Information Officer, Government of India, 
Former Special Officer on Kashmir Affairs in the External Affairs 
Ministry, New Delhi, and author 

Inder Jit 
Director-Editor, India News and Feature Alliance and 
Editor, The States, New Delhi 

Trevor Drieberg 
Political Commentator and Feature Writer 
Former News Editor, The Indian Express, New Delhi 

Uggar Sain 
Former News Editor and Assistant Editor, 
The Hindustan Times, New Delhi 

Printed and Published by Shantilal Harjivan Shah 
Navajivan Press, Ahmedabad-14 

enter image description here

0
phil On

I had the same issue and solved it by upgrading poppler from version 21.03.0 to 21.11.0.