I'm trying to build a sample application in java for Japaneses language that will read an image file and just output the text extracted from the image. I found one sample application on net which is running perfect for English Language but not for Japanees it is giving unidentified text, following is my code:
BytePointer outText;
TessBaseAPI api = new TessBaseAPI();
// Initialize tesseract-ocr with japanees, without specifying tessdata path
if (api.Init(".", "jpn") != 0) {
System.err.println("Could not initialize tesseract.");
System.exit(1);
}
// Open input image with leptonica library
PIX image = pixRead("test.png");
api.SetImage(image);
// Get OCR result
outText = api.GetUTF8Text();
String string = outText.getString();
assertTrue(!string.isEmpty());
System.out.println("OCR output:\n" + string);
// Destroy used object and release memory
api.End();
outText.deallocate();
pixDestroy(image);
my output is: OCR output: ETCカー-ード申 込書 �申�込�日 09/02/2017 ETC FeatureID ETCFFL ー申込枚輩交 画 枚
i has used jpn.tessdata and my application is reading tessdata file also. is any more configration needed? i'm using Tessaract 3.02 version with very clean image.
Yes! i got the solution, what we need to do is to set the locale in our java code as follows: olocale = new Locale.Builder().setLanguage("ja").setRegion("JP").build(); we can set locale for English language also in order to extract both Japanese as well as English text from Image.
now it is working like charm for me!!