I am trying to implement parallel processing using Paddle OCR.
Please refer to the method predictx_parallel
predictx_parallel(input_images, ocr_params, num_threads=1) - setting num_threads=1 works all the time
predictx_parallel(input_images, ocr_params, num_threads=2) - setting num_threads>1 fails with varying errors.
def predictx_parallel(input_images: List[Image], ocr_params: OcrParams, num_threads: int) -> Tuple[List[Dict], List[Image]]:
def ocr_image(image):
image_array = np.array(image) # type(image) : <class 'PIL.Image.Image'>
# type(image_array) : <class 'numpy.ndarray'>
results = ocr.ocr(image_array) # [[[[[381.0, 285.0], [537.0, 285.0], [537.0, 333.0], [381.0, 333.0]], ('PAGE1A', 0.9997838139533997)], [[[388.0, 371.0], [530.0, 371.0], [530.0, 419.0], [388.0, 419.0]], ('PAGE1B', 0.998117983341217)]]]
return results
ocr = get_ocr_obj(params=ocr_params) # <class 'paddleocr.paddleocr.PaddleOCR'>
with ThreadPoolExecutor(max_workers=num_threads) as executor:
results = list(executor.map(ocr_image, input_images))
UnimplementedError: Currently, only can set dims from DenseTensor or SelectedRows. (at /paddle/paddle/fluid/framework/infershape_utils.cc:314)
[operator < fused_conv2d > error]
NotFoundError: Variable Id 29797 is not registered.
[Hint: Expected it != Instance().id_to_type_map_.end(), but received it == Instance().id_to_type_map_.end().] (at /paddle/paddle/fluid/framework/var_type_traits.cc:103)
[operator < fused_conv2d > error]
paddlepaddle==2.6.0
paddleocr==2.7.0.3
python==3.9.12
Any suggestions please.
--- EDIT ---
I am able to get it to work. As I understand, the trick is to use new paddle OCR objects.
CAUSE: I created one single ocr object and used the same ocr object across multiple threads.
FIX: I tried multiprocessing, and in each process, create a new instance of ocr. It worked.
# pipeline.py
def predictx_parallel_processes(input_images, num_processes):
with Pool(processes=num_processes) as pool:
pool.map(ocr_image_x, input_images)
# ocr_processing.py
def ocr_image_x(image):
process_pid = os.getpid()
logger.info(f"Process PID: {process_pid}")
ocr = PaddleOCR() # Create new ocr object each time
image_array = np.array(image)
results = ocr.ocr(image_array)
logger.info(results)