I am looking for "character visual similarity" weighted data (not an algorithm) to plug into a weighted-Damerau-Levenshtein algorithm.
The Problem
Currently, I am using Google's Vision AI (a paid OCR service) to perform the OCR conversion of an image into text. I then want to look for the presence a phrase. For example, The Old Man and the S
e
a
. If the OCR results instead contain The Old Man and the S
c
a
(mis-read by the OCR), I can then use a basic Damerau-Levenshtein algorithm to find out that there is a substring with distance 1
and length 23
. Success!
But the problem I run into, is when I search for a (contrived) example like Disney's
T
angled
, but the image contains the phrase Walt Disney's
m
angled vision
. This is a false-positive, as it is not an OCR mis-classification. But it still returns a very convincing substring with distance 1
and length 16
. By my own judgment I reason that c
and e
are visually similar, but T
and m
are not.
What I've tried
I had originally attempted to solve some contrived examples with a basic Damerau-Levenshtein distance, and then attempted some regular expressions. For example, /The Old Man and the S[ce]a/
. I quickly realized this would devolve into patterns like /[5S][eo]cti[oe]n [1lI\|][1lI\|]3[B8]/
to match Section 113B
. I have no Machine Learning experience, but my research led me to the accepted answers for these questions:
How to determine character similarity? and
OCR and character similarity. Though it was insufficient for my needs, it inspired me to start working on a naive chart of generic character attributes to crunch through to find similarities:
Char | Left Profile | Right Profile | Top Profile | Bottom Profile | Height |
---|---|---|---|---|---|
a | low indent | low flat | curve | curve | half |
b | flat | low curve | curve, point | curve | full |
c | low curve | low indent | curve | curve | half |
Before I went further down this rabbit hole, I wanted to ask if my desired goal already exists publicly (paid service or free).
The Goal
My goal is to obtain a somewhat comprehensive weights dictionary. For example: c
can be substituted with e
for an arbitrary weight of 0.3
, instead of the standard substitution cost of 1.0
. This is because c
and e
are visually similar enough, that an OCR engine may mistake one for another. Likewise, X
can be substituted with K
for an arbitrary weight of 0.4
. This might result in a JSON dictionary like so:
{
"A": {
"4": 0.3,
"R": 0.6
// ...
},
"B": {
"8" : 0.4,
"3" : 0.8,
"R" : 0.7,
// ...
}
// ...
}
Accepted Answers
Will include one or more of:
- Links to publicly available "visual similarity data" where this has already been calculated.
- Links to pre-trained models whose data can be crunched into something like the above JSON object (along with general information of how to approach this).
- Your example of how you solved this or a similar problem, and the output you came up with.
- Recommendations for additional character attributes to look for.
- Recommendations for paid services which provide something like the above JSON object.
I'v recently created exactly that: https://github.com/zas97/ocr_weighted_levenshtein you can find the distances inside the params_weighted_leven.json in the GitHub repo.
What I've done is generate tons of synthetic text data and then I passed 3 different ocrs on that data. Then I've counted how often each confusion occurs, for exemple the confusion between 0 and O happens much more often than the confusion between O and P. The more a confusion happens the smaller the Levenshtein distance is between this two characters.