The Damerau-Levenshtein distance tells you the number of additions, deletions, substitutions and transpositions between two words (the latter is what differentiates DL from Levenshtein distance).
The algo is on wikipedia and relatively straightforward. However I want more than just the distance; I want the actual operations.
Eg a function that takes AABBCC
, compares it to ABZ
, and returns:
- Remove A at index 0 ->
ABBCC
- Remove B at index 2 ->
ABCC
- Remove C at index 4 ->
ABC
- Substitute C at index 5 for Z ->
ABZ
(ignore how the indices are affected by removals for now)
It seems you can do something with the matrix produced by the DL calculation. This site produces the output above. The text below says you should walk from the bottom right of the matrix, following each lowest cost operation in each cell (follow the bold cells):
- If Delete is lowest cost, go up one cell
- For Insert, go left one cell
- Otherwise for Substitute, Transpose or Equality go up and left
It seems to prioritise equality or substitution over anything else if there's a tie, so in the example I provided, when the bottom-right cell is 4 for both substitution and removal, it picks substitution.
However once it reaches the top left cell, equality is the lowest scoring operation, with 0. But it has picked deletion, with score 2.
This seems to be the right answer, because if you strictly pick the lowest score, you end up with too many As at the start of the string.
But what's the real steps for picking an operation, if not lowest score? Are there other ways to pick operations out of a DL matrix, and if so do you have a reference?
I missed a vital part of fuzzy-string's explanation of how to reconstruct the operations:
...which explains why the equality operation in cell [1,1] is ignored and the delete is used instead!