Designing an algorithm to calculate the edit distance between two strings

1k views Asked by At

Please consider the following question:

The edit distance of two strings s and t is the minimum number of single character operations (insert, delete, substitution) needed to convert s into t. Let m and n be the length of strings s and t.

Design an O(nm) time and O(nm) space algorithm to calculate the edit distance between s and t.

My thoughts:

Isn't it easier to just compare two strings one character at a time:

L = maximum(length(s), length(t))
for i in L:
   if i > length(s):
      distance += length(t) - i
      break
   if i > length(t):
      distance += length(s) - i
      break
   if s[i] != t[i]:
      distance += 1

If I am wrong, then am I supposed to use the edit distance algorithm table? Is so, how do I design an O(nm) time and O(nm) space algorithm?

2

There are 2 answers

0
James the Great On BEST ANSWER

Thank you @Roberto Attias for his answer, but the following is the complete algorithm I am looking for:

L1 = length(string1)
L2 = length(string2)
for i in L1:
    table[i][0] = i
for i in L2:
    table[0][i] = i
for i in L1:
    for j in L2:
        m = minimum(table[i-1][j],table[i][j-1])+1
        if s[i] == t[j]: subvalue = 1
        else: subvalue = 0
        table[i][j] = minimum(m, table[i-1][j-1] + subvalue)
return table[L1][L2]

The above algorithm follows the strategy of an edit distance algorithm table

3
Roberto Attias On

Consider the strings abcd and bcd. They differ for one deletion, but your approach would count them as distance 4.

What you want to do is find the Longest Common Subsequence. This is a well known problem and you can google up a lot of code examples about it, with one solution being in fact O (NM).

For example, for strings abcdqef and xybcdzzzef the LCS is bcdqef. consider the subsequence in the two strings:

a-bcd-q-ef
xy-bcd-zzz-ef

You can transform a into xy with one modification and one insertion, and q into zzz with one modification and two insertion. If you think about it, the number of operations required (i.e. distance) is the number of characters in the longest string not belonging to the LCS.