Is there public data for OCR-based character distance?

217 views Asked by At

I am looking for "character visual similarity" weighted data (not an algorithm) to plug into a weighted-Damerau-Levenshtein algorithm.

The Problem

Currently, I am using Google's Vision AI (a paid OCR service) to perform the OCR conversion of an image into text. I then want to look for the presence a phrase. For example, The Old Man and the Sea. If the OCR results instead contain The Old Man and the Sca (mis-read by the OCR), I can then use a basic Damerau-Levenshtein algorithm to find out that there is a substring with distance 1 and length 23. Success!

But the problem I run into, is when I search for a (contrived) example like Disney's Tangled, but the image contains the phrase Walt Disney's mangled vision. This is a false-positive, as it is not an OCR mis-classification. But it still returns a very convincing substring with distance 1 and length 16. By my own judgment I reason that c and e are visually similar, but T and m are not.

What I've tried

I had originally attempted to solve some contrived examples with a basic Damerau-Levenshtein distance, and then attempted some regular expressions. For example, /The Old Man and the S[ce]a/. I quickly realized this would devolve into patterns like /[5S][eo]cti[oe]n [1lI\|][1lI\|]3[B8]/ to match Section 113B. I have no Machine Learning experience, but my research led me to the accepted answers for these questions: How to determine character similarity? and OCR and character similarity. Though it was insufficient for my needs, it inspired me to start working on a naive chart of generic character attributes to crunch through to find similarities:

Char Left Profile Right Profile Top Profile Bottom Profile Height
a low indent low flat curve curve half
b flat low curve curve, point curve full
c low curve low indent curve curve half

Before I went further down this rabbit hole, I wanted to ask if my desired goal already exists publicly (paid service or free).

The Goal

My goal is to obtain a somewhat comprehensive weights dictionary. For example: c can be substituted with e for an arbitrary weight of 0.3, instead of the standard substitution cost of 1.0. This is because c and e are visually similar enough, that an OCR engine may mistake one for another. Likewise, X can be substituted with K for an arbitrary weight of 0.4. This might result in a JSON dictionary like so:

{
  "A": {
    "4": 0.3,
    "R": 0.6
    // ...
  },
  "B": {
    "8" : 0.4,
    "3" : 0.8,
    "R" : 0.7,
    // ...
  }
  // ...
}

Accepted Answers

Will include one or more of:

  1. Links to publicly available "visual similarity data" where this has already been calculated.
  2. Links to pre-trained models whose data can be crunched into something like the above JSON object (along with general information of how to approach this).
  3. Your example of how you solved this or a similar problem, and the output you came up with.
  4. Recommendations for additional character attributes to look for.
  5. Recommendations for paid services which provide something like the above JSON object.
1

There are 1 answers

0
joan capell On

I'v recently created exactly that: https://github.com/zas97/ocr_weighted_levenshtein you can find the distances inside the params_weighted_leven.json in the GitHub repo.

What I've done is generate tons of synthetic text data and then I passed 3 different ocrs on that data. Then I've counted how often each confusion occurs, for exemple the confusion between 0 and O happens much more often than the confusion between O and P. The more a confusion happens the smaller the Levenshtein distance is between this two characters.