Python Tesseract OCR training to a specific list of words

4.3k views Asked by At

I am quite new to OCR and to Tesseract.

So far I have a working script that is extracting fairly good text from images.

My doubt: is possible to train tesseract to retrieve only words/chars presented in some kind of dictionary file??

For example, I have an .txt with a big list of person names, and I want to train Tesseract that "SONIA" is not "50NlA" and "YANNICK" not "VANNlD", etc...

If it has a list of all possible names it will be able to give better accuracy? If the original image is a text with a lot of person names, and other information about that persons, but I want only to retrieve names from ocr and ignore the "noisy information", what can I do? Sorry if it is a stupid question.

I have read this https://groups.google.com/forum/#!topic/tesseract-ocr/r5qkHxQOT98 and the manual http://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html and created the eng.user-words and the bazaar files... what should be the next step? Since it gives me same outputs...

Thanks so much for your time and patient.

0

There are 0 answers