I'm trying to learn Python in parallel with the Tesseract API. My end goal is to learn how to use the Tesseract API to be able to read a document and do some basic error checking. I've found a few examples that seem to be good places to start, but I'm having trouble understanding the difference between two pieces of code that, while different in behavior, seem to me like they should be equivalent. These were both modified slightly from https://pypi.python.org/pypi/tesserocr .
The first example produces this output:
$ time ./GetComponentImagesExample2.py|tail -2
symbol MISSISSIPPI,conf: 88.3686599731
real 0m14.227s
user 0m13.534s
sys 0m0.397s
This is accurate and completes in 14 seconds. Reviewing the rest of the output, it is pretty good -- I'm probably a few SetVariable commands away from 99+% accuracy.
$ ./GetComponentImagesExample2.py|wc -l
1289
Manually reviewing the results, it appears to get all the text.
#!/usr/bin/python
from PIL import Image
Image.MAX_IMAGE_PIXELS=1000000000
from tesserocr import PyTessBaseAPI, RIL, iterate_level
image = Image.open('/Users/chrysrobyn/tess-install/tesseract/scan_2_new.tif')
with PyTessBaseAPI() as api:
api.SetImage(image)
api.Recognize()
api.SetVariable("save_blob_choices","T")
ri=api.GetIterator()
level=RIL.WORD
boxes = api.GetComponentImages(RIL.WORD, True)
print 'Found {} textline image components.'.format(len(boxes))
for r in iterate_level(ri, level):
symbol = r.GetUTF8Text(level)
conf = r.Confidence(level)
if symbol:
print u'symbol {},conf: {}\n'.format(symbol,conf).encode('utf-8')
The second example produces this output.
$ time ./GetComponentImagesExample4.py|tail -4
symbol MISSISS IPPI
,conf: 85
real 0m17.524s
user 0m16.600s
sys 0m0.427s
This is less accurate (extra space detected in a word) and slower (takes 17.5 seconds).
$ ./GetComponentImagesExample4.py|wc -l
223
This is sorely lacking a large amount of text and I don't understand why it misses some stuff.
#!/usr/bin/python
from PIL import Image
Image.MAX_IMAGE_PIXELS=1000000000
from tesserocr import PyTessBaseAPI, RIL
image = Image.open('/Users/chrysrobyn/tess-install/tesseract/scan_2_new.tif')
with PyTessBaseAPI() as api:
api.SetImage(image)
api.Recognize()
api.SetVariable("save_blob_choices","T")
boxes = api.GetComponentImages(RIL.WORD, True)
print 'Found {} textword image components.'.format(len(boxes))
for i, (im, box, _, _) in enumerate(boxes):
api.SetRectangle(box['x'], box['y'], box['w'], box['h'])
ocrResult = api.GetUTF8Text()
conf = api.MeanTextConf()
if ocrResult:
print u'symbol {},conf: {}\n'.format(ocrResult,conf).encode('utf-8')
# print (u"Box[{0}]: x={x}, y={y}, w={w}, h={h}, "
# "confidence: {1}, text: {2}").format(i, conf, ocrResult, **box).encode('utf-8')
My end goal relies on understanding where text is found in the document, so I need the bounding boxes like the second example. As near as I can tell, the iterate_level doesn't expose the coordinates of the found text, so I need the GetComponentImages... but the output is not equivalent.
Why do these pieces of code behave differently in speed and accuracy? Can I get GetComponentImages to match GetIterator?
I think the function r.BoundingBoxInternal(level) will give the bounding box of the detected word.