Explanation of normalized edit distance formula

2.7k views Asked by At

Based on this paper: IEEE TRANSACTIONS ON PAITERN ANALYSIS : Computation of Normalized Edit Distance and Applications In this paper Normalized Edit Distance as followed:

Given two strings X and Y over a finite alphabet, the normalized edit distance between X and Y, d( X , Y ) is defined as the minimum of W( P ) / L ( P )w, here P is an editing path between X and Y , W ( P ) is the sum of the weights of the elementary edit operations of P, and L(P) is the number of these operations (length of P).

enter image description here

Can i safely translate the normalized edit distance algorithm explained above as this:

normalized edit distance = 
levenshtein(query 1, query 2)/max(length(query 1), length(query 2))
2

There are 2 answers

5
zw324 On

You are probably misunderstanding the metric. There are two issues:

  1. The normalization step is to divide W(P) which is the weight of the edit procedure over L(P), which is the length of the edit procedure, not over the max length of the strings as you did;

  2. Also, the paper showed that (Example 3.1) normalized edit distance cannot be simply computed with levenshtein distance. You probably need to implement their algorithm.

An explanation of Example 3.1 (c):

From aaab to abbb, the paper used the following transformations:

  1. match a with a;
  2. skip a in the first string;
  3. skip a in the first string;
  4. skip b in the second string;
  5. skip b in the second string;
  6. match the final bs.

These are 6 operations which is why L(P) is 6; from the matrix in (a), matching has cost 0, skipping has cost 2, thus we have total cost of 0 + 2 + 2 + 2 + 2 + 0 = 8, which is exactly W(P), and W(P) / L(P) = 1.33. Similar results can be obtained for (b), which I'll left to you as exercise :-)

0
Rudi Cilibrasi On

The 3 in figure 2(a) refers to the cost of changing "a" to "b" or the cost of changing "b" to "a". The columns with lambdas in figure 2(a) mean that it costs 2 in order to insert or delete either an "a" or a "b".

In figure 2(b), W(P) = 6 because the algorithm does the following steps:

  1. keep first a (cost 0)
  2. convert first b to a (cost 3)
  3. convert second b to a (cost 3)
  4. keep last b (cost 0)

The sum of the costs of the steps is W(P). The number of steps is 4 which is L(P).

In figure 2(c), the steps are different:

  1. keep first a (cost 0)
  2. delete first b (cost 2)
  3. delete second b (cost 2)
  4. insert a (cost 2)
  5. insert a (cost 2)
  6. keep last b (cost 0)

In this path there are six steps so the L(P) is 6. The sum of the costs of the steps is 8 so W(P) is 8. Therefore the normalized edit distance is 8/6 = 4/3 which is about 1.33.