Using Python to extract the specific edit when Damerau-Levenshtein distance equals 1

440 views Asked by At

I have a large Pandas dataframe containing data entered at a keyboard. One of the columns in the dataframe represents UK postcode data. Inevitably, with large datasets, there are a number of typing errors. I'm using the pyxDamerauLevenshtein library to calculate the edit distance between unrecognised postcodes and an array containing all possible postcodes and then presenting postcodes that are only a single edit away from entered data (DL distance = 1) to the user as possible alternatives. This works pretty well and I'm reasonably happy with the speed. However, a single edit in postcode terms means there may be 50-60 alternatives. I'd like to be able to order the alternatives based on the type of edit identified. So, for an example, G substituting for F (adjacent on a QWERTY keyboard) will probably be more likely than L substituting for F. Also, an insertion of the same letter twice would be more likely than the insertion of an adjacent letter which, in turn, would be more likely than the insertion of a completely different letter from the opposite end of the keyboard. The order that alternative postcodes are presented should reflect these probabilities.

An answer by marmeladze at Edit distance such as Levenshtein taking into account proximity on keyboard suggested using the Euclidean distance between keyboard keys; this seems like a reasonable idea. However, my question is, how can I efficiently extract the specific edit involved between 2 strings when the Damerau-Levenshtein distance equals one?

As an example, if I have a postcode ZE2 9YM (which does not exist), the code should identify all other postcodes that are just one edit away but should also indicate the nature of the edit, maybe something like:

Entered code    Possible alternative    DL dist       Edit type    Edit
     ZE2 9YM                 ZE2 9YA          1    Substitution     A-M
     ZE2 9YM                 ZE2 9YN          1    Substitution     N-M
         ...

And, in the above case, it would be more likely that the M was substituted for the N (adjacent keys) rather than the M being substituted for the A.

Is anyone aware of a Python library that will calculate the Damerau-Levenshtein distance AND will output the matrix (together with summary of edits)?

0

There are 0 answers