Following function
def convert_pdf_to_img_extract_text(pdf_path: str, language='deu') -> str:
text = ''
images = []
pdf = pdfium.PdfDocument(pdf_path)
n_pages = len(pdf)
page_indices = [i for i in range(n_pages)]
pytesseract.pytesseract.tesseract_cmd = r"C:\\Users\\myname\\AppData\\Local\\Programs\\Tesseract-OCR\\tesseract.exe"
config = r"--oem 3 --psm 6 --tessdata-dir 'C:\\Users\\myname\\AppData\\Local\\Programs\\Tesseract-OCR\\tessdata\\'"
renderer = pdf.render_to(
pdfium.BitmapConv.pil_image,
page_indices = page_indices,
scale = 300/72,
)
for ele in renderer:
images.append(ele)
for image in images:
img_bytes = io.BytesIO()
image.save(img_bytes, format="PNG")
img_bytes.seek(0)
img = Image.open(img_bytes)
text += pytesseract.image_to_string(img, lang="deu", config=config)
return text
as part of a flask app can be executed successfully when serving the app with the built in server via flask run. As soon as i serve the same app with waitress
from waitress import serve
import app
serve(app.app, host="0.0.0.0", port="5000")
the server becomes silent after entering the function and showing following output, until i KeyboardInterrupt the execution.
2023-10-18 11:48:51 INFO Serving on http://0.0.0.0:5000
2023-10-18 11:48:58 INFO Webseite wurde aufgerufen von ipAdress
2023-10-18 11:49:08 WARNING Cannot perform concurrent rendering with buffer input - reading the whole buffer into memory implicitly.
2023-10-18 11:49:12 INFO Serving on http://0.0.0.0:5000
2023-10-18 11:49:12 INFO Serving on http://0.0.0.0:5000
2023-10-18 11:49:12 INFO Serving on http://0.0.0.0:5000
2023-10-18 11:49:12 INFO Serving on http://0.0.0.0:5000
2023-10-18 11:49:12 INFO Serving on http://0.0.0.0:5000
2023-10-18 11:49:12 INFO Serving on http://0.0.0.0:5000
2023-10-18 11:49:12 INFO Serving on http://0.0.0.0:5000
2023-10-18 11:49:12 INFO Serving on http://0.0.0.0:5000
The WARNING is displayed regardless of the way the app is served.
Update: i added some logging to pypdfium2/helpers/document.py to get some more information. The render_to()-function from pypdfium executes until it spawns processes with the ProcessPoolExecutor. Then it outputs 2023-10-18 11:49:12 INFO Serving on http://0.0.0.0:5000 8 times as shown above.
After interrupting the execution i get this output:
KeyboardInterrupt
2023-10-18 11:31:21 ERROR Exception when servicing <waitress.channel.HTTPChannel connected thisIsAnIp at 0x1e4d81cf460>
concurrent.futures.process._RemoteTraceback:
Traceback (most recent call last):
File "C:\Users\myName\AppData\Local\Programs\Python\Python38\lib\concurrent\futures\process.py", line 239, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File "C:\Users\myName\AppData\Local\Programs\Python\Python38\lib\concurrent\futures\process.py", line 198, in _process_chunk
return [fn(*args) for args in chunk]
File "C:\Users\myName\AppData\Local\Programs\Python\Python38\lib\concurrent\futures\process.py", line 198, in <listcomp>
return [fn(*args) for args in chunk]
File "C:\Users\myName\Documents\Git\mbk\venv\lib\site-packages\pypdfium2\_helpers\document.py", line 525, in _process_page
result = page.render_to(converter, **kwargs)
File "C:\Users\myName\Documents\Git\mbk\venv\lib\site-packages\pypdfium2\_helpers\page.py", line 370, in render_to
args = (self.render_base(**renderer_kws), renderer_kws)
File "C:\Users\myName\Documents\Git\mbk\venv\lib\site-packages\pypdfium2\_helpers\page.py", line 567, in render_base
pdfium.FPDF_RenderPageBitmap(*render_args)
KeyboardInterrupt
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Users\myName\Documents\Git\mbk\venv\lib\site-packages\waitress\task.py", line 84, in handler_thread
task.service()
File "C:\Users\myName\Documents\Git\mbk\venv\lib\site-packages\waitress\channel.py", line 428, in service
task.service()
File "C:\Users\myName\Documents\Git\mbk\venv\lib\site-packages\waitress\task.py", line 168, in service
self.execute()
File "C:\Users\myName\Documents\Git\mbk\venv\lib\site-packages\waitress\task.py", line 434, in execute
app_iter = self.channel.server.application(environ, start_response)
File "C:\Users\myName\Documents\Git\mbk\venv\lib\site-packages\flask\app.py", line 2548, in __call__
return self.wsgi_app(environ, start_response)
File "C:\Users\myName\Documents\Git\mbk\venv\lib\site-packages\flask\app.py", line 2525, in wsgi_app
response = self.full_dispatch_request()
File "C:\Users\myName\Documents\Git\mbk\venv\lib\site-packages\flask\app.py", line 1820, in full_dispatch_request
rv = self.dispatch_request()
File "C:\Users\myName\Documents\Git\mbk\venv\lib\site-packages\flask\app.py", line 1796, in dispatch_request
return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
File "C:\Users\myName\Documents\Git\mbk\app\views.py", line 318, in auswertung
pdf = ocr.convert_pdf_to_img_extract_text(massnahmebogen_raw)
File "C:\Users\myName\Documents\Git\mbk\app\packages\helpers\pdf_ocr.py", line 54, in convert_pdf_to_img_extract_text
for ele in renderer:
File "C:\Users\myName\Documents\Git\mbk\venv\lib\site-packages\pypdfium2\_helpers\document.py", line 594, in render_to
for result, index in pool.map(invoke_renderer, page_indices):
File "C:\Users\myName\AppData\Local\Programs\Python\Python38\lib\concurrent\futures\process.py", line 484, in _chain_from_iterable_of_lists
for element in iterable:
File "C:\Users\myName\AppData\Local\Programs\Python\Python38\lib\concurrent\futures\_base.py", line 611, in result_iterator
yield fs.pop().result()
File "C:\Users\myName\AppData\Local\Programs\Python\Python38\lib\concurrent\futures\_base.py", line 439, in result
return self.__get_result()
File "C:\Users\myName\AppData\Local\Programs\Python\Python38\lib\concurrent\futures\_base.py", line 388, in __get_result
raise self._exception
What could cause this when serving with waitress?