How to detect a "typo" for a certain phrase or regex?

5.1k views Asked by At

How can I detect a typo, but only for the specific phrase. Another way to think about it would be how can I detect a typo for a certain regex.

For example, I do not want a generic typo finder, I found multiple resources on that. I do not want a generic spell checker, again I found multiple resources on that.

How would I write a typo checker for a relatively constant value...say:

Super Secret 13-12345

It should always say "Super Secret NN-NNNNN" (N means any 0-9 number).

It would flag the following as "typos":

  1. Ssuper Secret 13-12345
  2. Super Secret 1312345
  3. Sper Scret 13-123456
  4. Spuer Secret 13-12345
  5. Super Secret
  6. 13-12345

It would NOT flag the following as "typos":

  1. Super Secret 13-12345
  2. Any other random words
  3. Superman flies over the jungle

I am most worried about extra characters leaking in, transposing characters, or numbers not following the NN-NNNNN format.

I feel like this is an answerable question, but I may just not be asking Google or SO using the correct words.

I am writing it in .NET, but could obviously port anything.

2

There are 2 answers

0
RoadieRich On

This isn't a good place for a regex: you would need a regex that detects every possible type of typo. Instead, you should be looking at the Levenshtein distance. It would work something like:

  1. replace all invalid characters with a placeholder, e.g. "!".
  2. replace all numbers with a different placeholder, e.g. "#".
  3. Calculate Levenshtein distance from "Super Secret ##-#####".
  4. If distance is below a certain value, and isn't 0, return true. Otherwise, return false.

Once you have it implemented, play with the threshold in step 4 to match the desired behaviour.

Edit: "Invalid character" can either mean any character other than those in "Superct0123456789-", or it can mean any non-alphanumeric other than "-". The end result should be the same.

0
Malik A. Rumi On

Why not search for your number pattern, and have a lookaround for Super Secret? If it isn't there, you can capture whatever is there and see if it is the spelling error you are looking for? Then you can add a simple replace - or re.sub() - to put the correct spelling in? Now you have to be careful, and build the regex up slowly. There's this thing about lookarounds that aren't fixed length, but I forget now if it's look ahead or look behind that have that issue. There are workarounds if you run into this issue. Make separate capture groups for your number - strict and specific - and another for your phrase - much more flexible, like with ? as a quantifier and character sets [sS] for known possible misspellings.