Reading in characters from word search image file with Tesseract

55 views Asked by At

I am trying to use OCR to read in the characters of a word search from a image file (jpeg, png, etc.) using Tesseract. The whitespace between characters that a normal word search has is a discrepancy and it produces a disproportionate output.

Input: one row

Output: RSSODIBACYLONCRYECTBOYDOB

I have added onto the API example provided by Tesseract a little bit in attempt to improve the accuracy, but I am not very familiar with OCR so it is difficult. I set the page segment mode to tesseract::PSM_SINGLE_CHAR and created a whitelist for characters, but still not quite getting the correct output.

#include <tesseract/baseapi.h>
#include <leptonica/allheaders.h>

int main()
{
    char* outText;

    tesseract::TessBaseAPI api;

    if (api.Init(NULL, "eng")) {
        fprintf(stderr, "Could not initialize tesseract.\n");
        exit(1);
    }

    api.SetPageSegMode(tesseract::PSM_SINGLE_CHAR);

    api.SetVariable("tessedit_char_whitelist", "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz");


    Pix *image = pixRead("image.jpg");
    api.SetImage(image);

    outText = api.GetUTF8Text();
    printf("OCR output\n%s", outText);

    api.End();
    delete[] outText;
    pixDestroy(&image);

    return 0;
}

How can I properly read in the characters from the word search image file? Any help would be appreciated.

0

There are 0 answers