Context:
I have PDF files I'm working with.
I'm using an ocr
to extract the text from these documents and to be able to do that I have to convert my pdf files to images.
I currently use the convert_from_path
function of the pdf2image
module but it is very time inefficient (9minutes for a 9page pdf).
Problem:
I am looking for a way to accelerate this process or another way to convert my PDF files to images.
Additional info:
I am aware that there is a thread_count
parameter in the function but after several tries it doesn't seem to make any difference.
This is the whole function I am using:
def pdftoimg(fic,output_folder):
# Store all the pages of the PDF in a variable
pages = convert_from_path(fic, dpi=500,output_folder=output_folder,thread_count=9, poppler_path=r'C:\Users\Vincent\Documents\PDF\poppler-21.02.0\Library\bin')
image_counter = 0
# Iterate through all the pages stored above
for page in pages:
filename = "page_"+str(image_counter)+".jpg"
page.save(output_folder+filename, 'JPEG')
image_counter = image_counter + 1
for i in os.listdir(output_folder):
if i.endswith('.ppm'):
os.remove(output_folder+i)
Link to the convert_from_path reference.
I found an answer to that problem using another module called
fitz
which is a python binding toMuPDF
.First of all install PyMuPDF:
The documentation can be found here but for windows users it's rather simple:
Then import the fitz module:
Open your file and save every page as images:
The get_pixmap() method accepts different parameters that allows you to control the image (variation,resolution,color...) so I suggest that you red the documentation here.
Hope this helps anyone else.