How to train and read the particular text from Image/PDF?

285 views Asked by At

Image or PDF may contains

**

  1. Printed text,
  2. Handwritten text,
  3. Paragraphs,
  4. Key value pairs,
  5. Complex Tables.

**

While training, we will assign the tags/keywords for the document. When testing will look for the tag and read the result for the tag.

1

There are 1 answers

6
Egor Richman On

You need to do 3 steps:

  1. First of all you should write basic object recognition algorithm for an image. The algorithm must crop your image to the ROIs (regions of interests), then it should classify each ROI by elements from your content type list. For this part you can use some heuristics rules (tables, for example, sometimes has a rectangle boundaries) to get ROIs features. Then you may use a lightweight classifier like a decision tree.

  2. Next you should provide algorithm for reading your data structure defined by a ROI type. For example, for table you should find all cells at the image. Then you need to find each word or number from your data structure and crop it to the symbols sets.

  3. When you have do it, you will have to classify each symbol by your text-image classifier. On this step, you can use a Multilayer Perceptron for example or Bayesian Naive Classifier, and another type of classifiers which usually used for image recognition.

In the practice, you could try OpenCV library, which already has almost all algorithms you need to do your stuff.

For better understanding of 3rd step you could watch my project for captcha recognizing based on OpenCV Artificial Neural Network feature usage.