Is it possible to write to a pdf file retroactively using pytesseract.image_to_data()
output?
For my OCR pipeline, I needed granular access to my pdf's ocr'ed data. I requested that using this method:
ocr_dataframe = pytesseract.image_to_data(
tesseract_image,
output_type=pytesseract.Output.DATAFRAME,
config=PYTESSERACT_CUSTOM_CONFIG
)
Now, I want to extract some tabular data from the pdf using pdfplumber. However, pdfplumber must be fed using one of three inputs:
- path to your PDF file
- file object, loaded as bytes
- file-like object, loaded as bytes
I am aware that I can use pytesseract to convert my original pdf to a searchable one (in bytes representation) using the following method:
# Get a searchable PDF
pdf = pytesseract.image_to_pdf_or_hocr('test.png', extension='pdf')
However, I would like to avoid ocr'ing my pdfs twice. Is it possible to combine the output from pytesseract.image_to_data()
with the original image and create some kind of bytes representation?
Any help would be much appreciated!
Okay, so I am pretty sure that this was an impossible task I was trying to complete.
By nature
pytesseract.Output.DATAFRAME
produces a pandas dataframe. Nowhere in that data structure is the original image. The output is just rows and columns of text data. No pixels, no nothing.Instead, I created a class that could hold the original image and the ocr output dataframe at the same time. Here is what the instance initialization looks like:
This may be a little memory intensive, but I want to avoid having to write many images.