Converting pytesseract.Output.DATAFRAME into bytes or ocr'ed pdf

386 views Asked by At

Is it possible to write to a pdf file retroactively using pytesseract.image_to_data() output?

For my OCR pipeline, I needed granular access to my pdf's ocr'ed data. I requested that using this method:

ocr_dataframe = pytesseract.image_to_data(
            tesseract_image, 
            output_type=pytesseract.Output.DATAFRAME,
            config=PYTESSERACT_CUSTOM_CONFIG
        )

Now, I want to extract some tabular data from the pdf using pdfplumber. However, pdfplumber must be fed using one of three inputs:

  • path to your PDF file
  • file object, loaded as bytes
  • file-like object, loaded as bytes

I am aware that I can use pytesseract to convert my original pdf to a searchable one (in bytes representation) using the following method:

# Get a searchable PDF
pdf = pytesseract.image_to_pdf_or_hocr('test.png', extension='pdf')

However, I would like to avoid ocr'ing my pdfs twice. Is it possible to combine the output from pytesseract.image_to_data() with the original image and create some kind of bytes representation?

Any help would be much appreciated!

1

There are 1 answers

0
abrezey On BEST ANSWER

Okay, so I am pretty sure that this was an impossible task I was trying to complete.

By nature pytesseract.Output.DATAFRAME produces a pandas dataframe. Nowhere in that data structure is the original image. The output is just rows and columns of text data. No pixels, no nothing.

Instead, I created a class that could hold the original image and the ocr output dataframe at the same time. Here is what the instance initialization looks like:

 def __init__(self, temp_image_path):
        

        self.image_path = pathlib.Path(temp_image_path)
        self.image = cv2.imread(temp_image_path, cv2.IMREAD_GRAYSCALE)
        self.ocr_dataframe = self.ocr()

  def ocr(self):

     
        #########################################
        # Preprocess image in prep for pytesseract ocr
        ########################################
        tesseract_image = ocr_preprocess(self.image)

        ########################################
        # OCR image using pytesseract
        ########################################
        ocr_dataframe = pytesseract.image_to_data(
            tesseract_image, 
            output_type=pytesseract.Output.DATAFRAME,
            config=PYTESSERACT_CUSTOM_CONFIG
        )

      
        return ocr_dataframe


This may be a little memory intensive, but I want to avoid having to write many images.