I want to extract the text from an image in python
. In order to do that, I have chosen pytesseract
. When I tried extracting the text from the image, the results weren't satisfactory. I also went through this and implemented all the techniques listed down. Yet, it doesn't seem to perform well.
Image:
Code:
import pytesseract
import cv2
import numpy as np
img = cv2.imread('D:\\wordsimg.png')
img = cv2.resize(img, None, fx=1.2, fy=1.2, interpolation=cv2.INTER_CUBIC)
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
kernel = np.ones((1,1), np.uint8)
img = cv2.dilate(img, kernel, iterations=1)
img = cv2.erode(img, kernel, iterations=1)
img = cv2.threshold(cv2.medianBlur(img, 3), 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files\\Tesseract-OCR\\tesseract.exe'
txt = pytesseract.image_to_string(img ,lang = 'eng')
txt = txt[:-1]
txt = txt.replace('\n',' ')
print(txt)
Output:
t hose he large form might light another us should took mountai house n story important went own own thought girl over family look some much ask the under why miss point make mile grow do own school was
Even 1 unwanted space could cost me a lot. I want the results to be 100% accurate. Any help would be appreciated. Thanks!
I changed resize from 1.2 to 2 and removed all preprocessing. I got good results with psm 11 and psm 12
The
config = '--oem 3 --psm %d' % psm
line uses the string interpolation (%) operator to replace%d
with an integer (psm). I'm not exactly sure whatoem
does, but I've gotten in the habit of using it. More onpsm
at the end of this answer.psm
is short for page segmentation mode. I'm not exactly sure what the different modes are. You can get a feel for what the codes are from the descriptions. You can get the list fromtesseract --help-psm