OCR text recognition wrong text displaying

1.3k views Asked by At

I am new to tess-two library. I am able to add that library and getting image from drawable and its even converting, but I am getting wrong text as below:

Here is my complete code:

Bitmap image;
private TessBaseAPI mTess;
String datapath = "";

@Override
protected void onCreate(Bundle savedInstanceState) {
    super.onCreate(savedInstanceState);
    setContentView(R.layout.activity_main);
    //init image
    image = BitmapFactory.decodeResource(getResources(), R.drawable.test_image);

    //initialize Tesseract API
    String language = "eng";
    datapath = getFilesDir()+ "/tesseract/";
    mTess = new TessBaseAPI();

    checkFile(new File(datapath + "tessdata/"));

    mTess.init(datapath, language);
}

private void checkFile(File file) {
    if (!file.exists()&& file.mkdirs()){
        copyFiles();
    }
    if(file.exists()) {
        String datafilepath = datapath+ "/tessdata/eng.traineddata";
        File datafile = new File(datafilepath);

        if (!datafile.exists()) {
            copyFiles();
        }
    }
}


public void processImage(View view){
    String OCRresult = null;
    mTess.setImage(image);     
    OCRresult = mTess.getUTF8Text();
    TextView OCRTextView = (TextView) findViewById(R.id.OCRTextView);
    OCRTextView.setText(OCRresult);
}

private void copyFiles() {
    try {
        String filepath = datapath + "/tessdata/eng.traineddata";
        AssetManager assetManager = getAssets();

        InputStream instream = assetManager.open("tessdata/eng.traineddata");
        OutputStream outstream = new FileOutputStream(filepath);

        byte[] buffer = new byte[1024];
        int read;
        while ((read = instream.read(buffer)) != -1) {
            outstream.write(buffer, 0, read);
        }


        outstream.flush();
        outstream.close();
        instream.close();

        File file = new File(filepath);
        if (!file.exists()) {
            throw new FileNotFoundException();
        }
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }
}

I am getting text like:

mmmm.and,mmm,1111 etc

Any help is appreciated.

2

There are 2 answers

2
Saranjith On

Possible Isuues that may have:

  1. Incorrect OCR-ed text
  2. Add the keywords in your training data Follow the tutorial Tesseract Tutorial Page
0
Genarito On

I had the same issue. It have fixed it 2 minutes ago, you have to resize your image to a bigger size. I used thumbnailator library to do the job:

BufferedImage bigger = Thumbnails.of(oldImage).size(700, 500).asBufferedImage();

I hope it helps and apologise for my awful English.

Note: more information about resizing here