Time efficient way to convert PDF to image

4.3k views Asked by At

Context:

I have PDF files I'm working with. I'm using an ocr to extract the text from these documents and to be able to do that I have to convert my pdf files to images. I currently use the convert_from_path function of the pdf2image module but it is very time inefficient (9minutes for a 9page pdf).

Problem:

I am looking for a way to accelerate this process or another way to convert my PDF files to images.

Additional info:

I am aware that there is a thread_count parameter in the function but after several tries it doesn't seem to make any difference.

This is the whole function I am using:

def pdftoimg(fic,output_folder):
# Store all the pages of the PDF in a variable 
pages = convert_from_path(fic, dpi=500,output_folder=output_folder,thread_count=9, poppler_path=r'C:\Users\Vincent\Documents\PDF\poppler-21.02.0\Library\bin') 

image_counter = 0

# Iterate through all the pages stored above 
for page in pages: 
    filename = "page_"+str(image_counter)+".jpg"
    page.save(output_folder+filename, 'JPEG') 
    image_counter = image_counter + 1
    
for i in os.listdir(output_folder):
    if i.endswith('.ppm'):
        os.remove(output_folder+i)

Link to the convert_from_path reference.

1

There are 1 answers

3
zanga On BEST ANSWER

I found an answer to that problem using another module called fitz which is a python binding to MuPDF.

First of all install PyMuPDF:

The documentation can be found here but for windows users it's rather simple:

pip install PyMuPDF

Then import the fitz module:

import fitz
print(fitz.__doc__)

>>>PyMuPDF 1.18.13: Python bindings for the MuPDF 1.18.0 library.
>>>Version date: 2021-05-05 06:32:22.
>>>Built for Python 3.7 on win32 (64-bit).

Open your file and save every page as images:

The get_pixmap() method accepts different parameters that allows you to control the image (variation,resolution,color...) so I suggest that you red the documentation here.

def convert_pdf_to_image(fic):
    #open your file
    doc = fitz.open(fic)
    #iterate through the pages of the document and create a RGB image of the page
    for page in doc:
        pix = page.get_pixmap()
        pix.save("page-%i.png" % page.number)

Hope this helps anyone else.