Parallel processing fails with Paddle OCR

268 views Asked by At

I am trying to implement parallel processing using Paddle OCR. Please refer to the method predictx_parallel

predictx_parallel(input_images, ocr_params, num_threads=1) - setting num_threads=1 works all the time

predictx_parallel(input_images, ocr_params, num_threads=2) - setting num_threads>1 fails with varying errors.

def predictx_parallel(input_images: List[Image], ocr_params: OcrParams, num_threads: int) -> Tuple[List[Dict], List[Image]]:
    def ocr_image(image):       
        image_array = np.array(image)   # type(image) : <class 'PIL.Image.Image'>
                                        # type(image_array) : <class 'numpy.ndarray'> 
        results = ocr.ocr(image_array)  # [[[[[381.0, 285.0], [537.0, 285.0], [537.0, 333.0], [381.0, 333.0]], ('PAGE1A', 0.9997838139533997)], [[[388.0, 371.0], [530.0, 371.0], [530.0, 419.0], [388.0, 419.0]], ('PAGE1B', 0.998117983341217)]]]
        return results

    ocr = get_ocr_obj(params=ocr_params)  # <class 'paddleocr.paddleocr.PaddleOCR'>
   
    with ThreadPoolExecutor(max_workers=num_threads) as executor:
        results = list(executor.map(ocr_image, input_images))
UnimplementedError: Currently, only can set dims from DenseTensor or SelectedRows. (at /paddle/paddle/fluid/framework/infershape_utils.cc:314)
      [operator < fused_conv2d > error]
NotFoundError: Variable Id 29797 is not registered.
      [Hint: Expected it != Instance().id_to_type_map_.end(), but received it == Instance().id_to_type_map_.end().] (at /paddle/paddle/fluid/framework/var_type_traits.cc:103)
      [operator < fused_conv2d > error]

paddlepaddle==2.6.0 paddleocr==2.7.0.3 python==3.9.12

Any suggestions please.

--- EDIT ---

I am able to get it to work. As I understand, the trick is to use new paddle OCR objects.

CAUSE: I created one single ocr object and used the same ocr object across multiple threads.

FIX: I tried multiprocessing, and in each process, create a new instance of ocr. It worked.

# pipeline.py

def predictx_parallel_processes(input_images, num_processes):
    with Pool(processes=num_processes) as pool:
        pool.map(ocr_image_x, input_images)
# ocr_processing.py

def ocr_image_x(image):
    process_pid = os.getpid()
    logger.info(f"Process PID: {process_pid}")
    ocr = PaddleOCR()  # Create new ocr object each time
    image_array = np.array(image)
    results = ocr.ocr(image_array)
    logger.info(results)   
0

There are 0 answers