How can I make pytesseract read slahed 0 correctly

1k views Asked by At

I am trying to read the phone number on the image. Since the image is very clear, I didn't apply any preprocessing yet pytesseract fails to recognize 0 correctly sometimes. I tried to train on similar font but it gives the same result. An example is this image

My code is pretty straightforward:

image=Image.open('Fotolar/0.png')
custom_config = r'--oem 3 --psm 6'
pytesseract.image_to_string(image,config=custom_config)

I get this result: '9543 684 9993'

I tried fine-tuning with my images but I couldn't do it because all tutorials were ubuntu based and I am not familiar with it. Do you have any suggestions?

2

There are 2 answers

0
Yusuf UYANIK On BEST ANSWER

I followed this tutorial https://www.youtube.com/watch?v=JPDeiGc2an8&t=444s and used files and instruction on this repo https://github.com/kevinbicycle/ocrd-train.

Tutorial was pretty clear. If you want to fine-tune like me, at the and of tutorial, instead of typing "make training", add some of the variables like "START_MODEL".

You can also use my slashedzero.traineddata if your problem was identical to mine https://github.com/yusufuyanik1/SlashedZeroOCR.

0
angus north On

Likely the questioner has moved on in frustration after 2 years, but I've just been banging my head against the same problem for the past day: pytesseract was misinterpreting slashed zeros in a new font I was trying to train on, even in the image I trained it on. This solution worked for me, though I've no idea what it means or why it works: https://groups.google.com/g/tesseract-ocr/c/FE2YDm67-gU

If you don't want to immerse yourself in learning all about OCR training, I recommend the JTessBoxEditor GUI, available on Windows to make box files for your images, edit the boxes, then train for the new font. You can put the NewFont.traineddata file thus created into C:\Program Files\Tesseract-OCR\tessdata then use it in python with

import cv2
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
path_to_image = # put image file path here 
image = cv2.imread(path_to_image)
string = pytesseract.image_to_data(image, config="-l NewFont")

To implement the slashed zero fix discussed in https://groups.google.com/g/tesseract-ocr/c/FE2YDm67-gU for JTessBoxEditor, add this line to the end of the config file ..\jTessBoxEditor\tesseract-ocr\tessdata\configs\box.train

edges_use_new_outline_complexity T