How can I make pytesseract read slahed 0 correctly

Question

How can I make pytesseract read slahed 0 correctly

1k views Asked by Yusuf UYANIK At 22 July 2020 at 12:11

I am trying to read the phone number on the image. Since the image is very clear, I didn't apply any preprocessing yet pytesseract fails to recognize 0 correctly sometimes. I tried to train on similar font but it gives the same result. An example is this image

My code is pretty straightforward:

image=Image.open('Fotolar/0.png')
custom_config = r'--oem 3 --psm 6'
pytesseract.image_to_string(image,config=custom_config)

I get this result: '9543 684 9993'

I tried fine-tuning with my images but I couldn't do it because all tutorials were ubuntu based and I am not familiar with it. Do you have any suggestions?

Original Q&A

There are 2 answers

angus north On 06 June 2023 at 11:20

Likely the questioner has moved on in frustration after 2 years, but I've just been banging my head against the same problem for the past day: pytesseract was misinterpreting slashed zeros in a new font I was trying to train on, even in the image I trained it on. This solution worked for me, though I've no idea what it means or why it works: https://groups.google.com/g/tesseract-ocr/c/FE2YDm67-gU

If you don't want to immerse yourself in learning all about OCR training, I recommend the JTessBoxEditor GUI, available on Windows to make box files for your images, edit the boxes, then train for the new font. You can put the NewFont.traineddata file thus created into C:\Program Files\Tesseract-OCR\tessdata then use it in python with

import cv2
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
path_to_image = # put image file path here 
image = cv2.imread(path_to_image)
string = pytesseract.image_to_data(image, config="-l NewFont")

To implement the slashed zero fix discussed in https://groups.google.com/g/tesseract-ocr/c/FE2YDm67-gU for JTessBoxEditor, add this line to the end of the config file ..\jTessBoxEditor\tesseract-ocr\tessdata\configs\box.train

edges_use_new_outline_complexity T

**Yusuf UYANIK** · Accepted Answer · 2020-08-14T06:20:14+00:00

I followed this tutorial https://www.youtube.com/watch?v=JPDeiGc2an8&t=444s and used files and instruction on this repo https://github.com/kevinbicycle/ocrd-train.

Tutorial was pretty clear. If you want to fine-tune like me, at the and of tutorial, instead of typing "make training", add some of the variables like "START_MODEL".

You can also use my slashedzero.traineddata if your problem was identical to mine https://github.com/yusufuyanik1/SlashedZeroOCR.

TechQA.

How can I make pytesseract read slahed 0 correctly

There are 2 answers

Related Questions in PYTHON-3.X

Related Questions in OCR

Related Questions in TESSERACT

Related Questions in PYTHON-TESSERACT

Related Questions in PYTESSER

Popular Questions

Popular Tags

Trending Questions