How to build OCR data for Android app with known layout and font

476 views Asked by At

After trying several available eng.traineddata files with an Android app that employs Tesseract, I have less than stellar accuracy. Since my application will be using just a few fonts (font sizes, bold and regular), I thought I could get much better accuracy by building my own data. An example of the kind of thing (an 8.5x11 inch paper) that users will taking a picture of is here: Sample menu that users will be scanning

I have looked at jTessBoxEditor, but wondered if that was an appropriate path to investigate. And if so, I was unsure how to proceed with respect to a starting point, or to try from scratch. The font (which looks like Times New Roman) is very common, and didn't want to re-invent the wheel. I also wondered about how to treat the font on the two different color backgrounds.

Also, I wondered if I could just print-out ABC... abc... 123... in Times New Roman font and get that into a custom eng.traineddata file. If I understand correctly, you want the 'cleanest' data (i.e. no 'bad examples' of letters) in the source material used to train your system. But it would seem as if there would be a tutorial or procedure defined for how to build trained data for a specific font. If there is, it's been eluding me.

1

There are 1 answers

0
James Black On

I would consider using machine learning, but so you don't have to do it on your own, look at Tensorflow Mobile. This is a version that is for mobile devices, and to help with character recognition you can look at this article.

To train any neural net a set of training data along with correct outputs must be provided. In this case this will be a set of 128x64 images along with the expected output.

This will help you easily implement a solution to recognize the characters, and by going with this approach you can extend to more fonts if you desire by just doing more training.