I have been trying to extract data from scanned pdf documents. I have converted the pdf file into jpeg file (I have attached the image link below), cropped the words and numbers with different fonts, merged into a tiff file and trained the fonts using jTessBoxEditor to generate a new language and I used that language in Tesseract-OCR to extract the data from the file. But I couldn't extract the exact data. The text recognition accuracy of tesseract-ocr is very poor.
Can someone suggest a method to improve the accuracy?
Did you use
image_to_string
method?Part of Output:
Updated:
Inst No
is extracted usingpytesseract