Why does reading text from image using pytesseract doesnt work?

235 views Asked by At

Here is my code:

import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'F:\Installations\tesseract'
print(pytesseract.image_to_string('images/meme1.png', lang='eng'))

And here is the image:
image

And the output is as follows:

GP.
ed <a

= va
ay Roce Thee .
‘ , Pe ship
   
RCAC Tm alesy-3

Pein Reg a

years —
? >
ee bs

I see the word years in the output so it does recognize the text but why doesn't it recognize it fully?

3

There are 3 answers

0
docair On

OCR is still a very hard problem in cluttered scenes. You probably won't get better results without doing some preprocessing on the image. In this specific case it makes sense to threshold the image first, to only extract the white regions (i.e. the text). You can look into opencv for this: https://docs.opencv.org/3.4/d7/d4d/tutorial_py_thresholding.html

Additionally, in your image, there are only two lines of text in arbitrary positions, so it might make sense to play around with page segmentation modes: https://github.com/tesseract-ocr/tesseract/issues/434

2
Iyyappan N On

if above code using the method sharp the image more perfect image and then using finally threshold the image to pytesseract using about step image_to_string get the good result

import cv2
import pytesseract
import numpy as np


image = cv2.imread('human.png')
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)


val=188
_,thres=cv2.threshold(image,val,255,cv2.THRESH_BINARY_INV)

text = pytesseract.image_to_string(thres ,config=' --psm 12 ')
print(text)

boxes = pytesseract.image_to_boxes(thres, config=' --psm 12')

for box in boxes.splitlines():
    _, x, y, w, h, _ = box.split()

    x, y, w, h = int(x), int(y), int(w), int(h)

    cv2.rectangle(image, (x, image.shape[0] - y), (w, image.shape[0] - h), (0, 255, 0), 2)

cv2.imshow('Image', image)
cv2.waitKey(0)
0
Esraa Abdelmaksoud On

Let me explain the reason.

TesseractOCR is a model that was trained using black and white images. When I say black and white, I mean that the pixels of training images were either 0 or 255. The images weren't even in gray-scale.

So, when you pass an image to Tesseract, it applies something called "OTSU binarization". This binarization technique converts images to black and white, namely 0s and 255s. This is to have an input that is similar to what it was trained to recognize.

When you use a scene image like yours, the binarization leads to loss of text and having some random black areas. When these areas are recognized, you get messy results with low confidence scores.

Thus, you should use an OCR engine that processes images in RGB or BGR (the three color channels) such as PaddleOCR. However, keep in mind that you will need powerful GPUs for such OCR engines with videos.

I hope this explanation helps.