After trying several available eng.traineddata
files with an Android app that employs Tesseract, I have less than stellar accuracy. Since my application will be using just a few fonts (font sizes, bold and regular), I thought I could get much better accuracy by building my own data. An example of the kind of thing (an 8.5x11 inch paper) that users will taking a picture of is here:
I have looked at jTessBoxEditor, but wondered if that was an appropriate path to investigate. And if so, I was unsure how to proceed with respect to a starting point, or to try from scratch. The font (which looks like Times New Roman) is very common, and didn't want to re-invent the wheel. I also wondered about how to treat the font on the two different color backgrounds.
Also, I wondered if I could just print-out ABC... abc... 123... in Times New Roman font and get that into a custom eng.traineddata
file. If I understand correctly, you want the 'cleanest' data (i.e. no 'bad examples' of letters) in the source material used to train your system. But it would seem as if there would be a tutorial or procedure defined for how to build trained data for a specific font. If there is, it's been eluding me.
I would consider using machine learning, but so you don't have to do it on your own, look at Tensorflow Mobile. This is a version that is for mobile devices, and to help with character recognition you can look at this article.
This will help you easily implement a solution to recognize the characters, and by going with this approach you can extend to more fonts if you desire by just doing more training.