python tesseract get number of lines without OCR

1k views Asked by At

I am trying to determine the number of lines of text without doing OCR. I want to bypass OCR and give the user an error if they have given too many lines of text to process (It'll take too long and it's not the kind of input that should be given). Ideally, I would like help doing this in python, but if there are any c++ examples that do this, I may be able to adapt them.

Here are the API functions I can work with: http://zdenop.github.io/tesseract-doc/group___advanced_a_p_i.html

I can use these functions, but I don't know a way to deal with BLOCK_LIST, ETEXT_DESC, or Boxa objects in python except to feed them from one API call to another.

Any help would be greatly appreciated!

1

There are 1 answers

0
whunterknight On BEST ANSWER

This may not be the best way, but it works in just a few seconds and allows me to know when I should cancel OCR due to longer than expected execution based on number of symbols found, assuming I put the OCR operation in its own thread that can be killed. You can also find the number of lines (RIL_TEXTLINE), but if you have multiple columns, you'll get a lot more lines as a result.

import tesseract
import cv2.cv as cv 

api = tesseract.TessBaseAPI()
api.Init('.','eng',tesseract.OEM_DEFAULT)
api.SetPageSegMode(tesseract.PSM_AUTO_OSD)

# Load image
img_data = cv2.imread('file.jpg')
image = cv.CreateImageHeader((width1,height1), cv.IPL_DEPTH_8U, channel1) 
cv.SetData(image, img_data.tostring(),img_data.dtype.itemsize * channel1 * (width1))
tesseract.SetCvImage(image,api)

# Check number of chars
chars_iterator = api.AnalyseLayout()
num_chars = 1
while chars_iterator.Next(tesseract.RIL_SYMBOL) is True: num_chars += 1

# Break of there are too many chars
if num_chars > 1000:
    print "Too many chars!"
    break

# Reset api to delete previous layout iterator
api.Clear()
tesseract.SetCvImage(image,api)

# Do real OCR, and put this in its own thread if you want to kill it when it takes too long
result_xml = api.GetHOCRText(1)
print api.GetUTF8Text()