Tesseract is giving junk data as an output for Japaneses language

687 views Asked by At

I'm trying to build a sample application in java for Japaneses language that will read an image file and just output the text extracted from the image. I found one sample application on net which is running perfect for English Language but not for Japanees it is giving unidentified text, following is my code:

BytePointer outText;

    TessBaseAPI api = new TessBaseAPI();
    // Initialize tesseract-ocr with japanees, without specifying tessdata path
    if (api.Init(".", "jpn") != 0) {
        System.err.println("Could not initialize tesseract.");
        System.exit(1);
    }

    // Open input image with leptonica library
    PIX image = pixRead("test.png");
    api.SetImage(image);
    // Get OCR result
    outText = api.GetUTF8Text();
    String string = outText.getString();
    assertTrue(!string.isEmpty());
    System.out.println("OCR output:\n" + string);

    // Destroy used object and release memory
    api.End();
    outText.deallocate();
    pixDestroy(image);

my output is: OCR output: ETCカー-ード申 込書 �申�込�日 09/02/2017 ETC FeatureID ETCFFL ー申込枚輩交 画 枚

i has used jpn.tessdata and my application is reading tessdata file also. is any more configration needed? i'm using Tessaract 3.02 version with very clean image.

1

There are 1 answers

0
Aditya On

Yes! i got the solution, what we need to do is to set the locale in our java code as follows: olocale = new Locale.Builder().setLanguage("ja").setRegion("JP").build(); we can set locale for English language also in order to extract both Japanese as well as English text from Image.

now it is working like charm for me!!