How do I use the Tesseract API to iterate over words?

6k views Asked by At

I'm trying to learn Python in parallel with the Tesseract API. My end goal is to learn how to use the Tesseract API to be able to read a document and do some basic error checking. I've found a few examples that seem to be good places to start, but I'm having trouble understanding the difference between two pieces of code that, while different in behavior, seem to me like they should be equivalent. These were both modified slightly from https://pypi.python.org/pypi/tesserocr .

The first example produces this output:

$ time ./GetComponentImagesExample2.py|tail -2
symbol MISSISSIPPI,conf: 88.3686599731


real    0m14.227s
user    0m13.534s
sys 0m0.397s

This is accurate and completes in 14 seconds. Reviewing the rest of the output, it is pretty good -- I'm probably a few SetVariable commands away from 99+% accuracy.

$ ./GetComponentImagesExample2.py|wc -l
    1289

Manually reviewing the results, it appears to get all the text.

#!/usr/bin/python
from PIL import Image
Image.MAX_IMAGE_PIXELS=1000000000
from tesserocr import PyTessBaseAPI, RIL, iterate_level

image = Image.open('/Users/chrysrobyn/tess-install/tesseract/scan_2_new.tif')
with PyTessBaseAPI() as api:
    api.SetImage(image)
    api.Recognize()
    api.SetVariable("save_blob_choices","T")
    ri=api.GetIterator()
    level=RIL.WORD
    boxes = api.GetComponentImages(RIL.WORD, True)
    print 'Found {} textline image components.'.format(len(boxes))
    for r in iterate_level(ri, level):
        symbol = r.GetUTF8Text(level)
        conf = r.Confidence(level)
        if symbol:
            print u'symbol {},conf: {}\n'.format(symbol,conf).encode('utf-8')

The second example produces this output.

$ time ./GetComponentImagesExample4.py|tail -4
symbol MISSISS IPPI
,conf: 85


real    0m17.524s
user    0m16.600s
sys 0m0.427s

This is less accurate (extra space detected in a word) and slower (takes 17.5 seconds).

$ ./GetComponentImagesExample4.py|wc -l
     223

This is sorely lacking a large amount of text and I don't understand why it misses some stuff.

#!/usr/bin/python
from PIL import Image
Image.MAX_IMAGE_PIXELS=1000000000
from tesserocr import PyTessBaseAPI, RIL

image = Image.open('/Users/chrysrobyn/tess-install/tesseract/scan_2_new.tif')
with PyTessBaseAPI() as api:
    api.SetImage(image)
    api.Recognize()
    api.SetVariable("save_blob_choices","T")
    boxes = api.GetComponentImages(RIL.WORD, True)
    print 'Found {} textword image components.'.format(len(boxes))
    for i, (im, box, _, _) in enumerate(boxes):
        api.SetRectangle(box['x'], box['y'], box['w'], box['h'])
        ocrResult = api.GetUTF8Text()
        conf = api.MeanTextConf()
        if ocrResult:
            print u'symbol {},conf: {}\n'.format(ocrResult,conf).encode('utf-8')
#        print (u"Box[{0}]: x={x}, y={y}, w={w}, h={h}, "
#               "confidence: {1}, text: {2}").format(i, conf, ocrResult, **box).encode('utf-8')

My end goal relies on understanding where text is found in the document, so I need the bounding boxes like the second example. As near as I can tell, the iterate_level doesn't expose the coordinates of the found text, so I need the GetComponentImages... but the output is not equivalent.

Why do these pieces of code behave differently in speed and accuracy? Can I get GetComponentImages to match GetIterator?

1

There are 1 answers

0
durjoy On
api.Recognize()
api.SetVariable("save_blob_choices","T")
ri=api.GetIterator()
level=tesserocr.RIL.WORD
boxes = api.GetComponentImages(tesserocr.RIL.TEXTLINE, True)
text_list = []
print 'Found {} textline image components.'.format(len(boxes))
i = 0
for r in tesserocr.iterate_level(ri, level):
    symbol = r.GetUTF8Text(level)
    conf = r.Confidence(level)
    bbox = r.BoundingBoxInternal(level)
    im = Image.fromarray(img[bbox[1]:bbox[3], bbox[0]:bbox[2]])
    im.save("../out/" + str(i) + ".tif")
    text_list.append(symbol + " " + str(conf) + "\n")
    i += 1

I think the function r.BoundingBoxInternal(level) will give the bounding box of the detected word.