I have a code which is looking for a particular string sequence throughout a bunch of pdfs. The problems is that this process is extremely slow. (Sometimes I get pdf's with over 50000 pages)
Is there a way to do multi threading? Unfortunately even though I searched, I couldn't make heads or tails about the threading codes
import os
import shutil as sh
f = 'C:/Users/akhan37/Desktop/learning profiles/unzipped/unzipped_files'
import slate3k as slate
idee = "123456789"
os.chdir(f)
for file in os.listdir('.'):
print(file)
with open(file,'rb') as g:
extracted_text = slate.PDF(g)
#extracted_text = slate.PDF()
# print(Text)
if idee in extracted_text:
print(file)
else:
pass
The run time is very long. I don't think it's the codes fault but rather the fact that I have to go through over 700 pdfs
I would suggest using pdfminer, you can convert to the document object into a list of page object, which you can multi-processing on different cores.