Asymmetric Levenshtein distance

2.3k views Asked by At

Given two bit strings, x and y, with x longer than y, I'd like to compute a kind of asymmetric variant of the Levensthein distance between them. Starting with x, I'd like to know the minimum number of deletions and substitutions it takes to turn x into y.

Can I just use the usual Levensthein distance for this, or do I need I need to modify the algorithm somehow? In other words, with the usual set of edits of deletion, substitution, and addition, is it ever beneficial to delete more than the difference in lengths between the two strings and then add some bits back? I suspect the answer is no, but I'm not sure. If I'm wrong, and I do need to modify the definition of Levenshtein distance to disallow deletions, how do I do so?

Finally, I would expect intuitively that I'd get the same distance if I started with y (the shorter string) and only allowed additions and substitutions. Is this right? I've got a sense for what these answers are, I just can't prove them.

1

There are 1 answers

3
hatchet - done with SOverflow On BEST ANSWER

If i understand you correctly, I think the answer is yes, the Levenshtein edit distance could be different than an algorithm that only allows deletions and substitutions to the larger string. Because of this, you would need to modify, or create a different algorithm to get your limited version.

Consider the two strings "ABCD" and "ACDEF". The Levenshtein distance is 3 (ABCD->ACD->ACDE->ACDEF). If we start with the longer string, and limit ourselves to deletions and substitutions we must use 4 edits (1 deletion and 3 substitutions. The reason is that strings where deletions are applied to the smaller string to efficiently get to the larger string can't be achieved when starting with the longer string, because it does not have the complimentary insertion operation (since you're disallowing that).

Your last paragraph is true. If the path from shorter to longer uses only insertions and substitutions, then any allowed path can simply be reversed from the longer to the shorter. Substitutions are the same regardless of direction, but the inserts when going from small to large become deletions when reversed.

I haven't tested this thoroughly, but this modification shows the direction I would take, and appears to work with the values I've tested with it. It's written in c#, and follows the psuedo code in the wikipedia entry for Levenshtein distance. There are obvious optimizations that can be made, but I refrained from doing that so it was more obvious what changes I've made from the standard algorithm. An important observation is that (using your constraints) if the strings are the same length, then substitution is the only operation allowed.

    static int LevenshteinDistance(string s, string t) {
        int i, j;
        int m = s.Length;
        int n = t.Length;

        // for all i and j, d[i,j] will hold the Levenshtein distance between
        // the first i characters of s and the first j characters of t;
        // note that d has (m+1)*(n+1) values
        var d = new int[m + 1, n + 1];

        // set each element to zero
        // c# creates array already initialized to zero

        // source prefixes can be transformed into empty string by
        // dropping all characters
        for (i = 0; i <= m; i++) d[i, 0] = i;

        // target prefixes can be reached from empty source prefix
        // by inserting every character
        for (j = 0; j <= n; j++) d[0, j] = j;

        for (j = 1; j <= n; j++) {
            for (i = 1; i <= m; i++) {
                if (s[i - 1] == t[j - 1])
                    d[i, j] = d[i - 1, j - 1];       // no operation required
                else {
                    int del = d[i - 1, j] + 1;   // a deletion
                    int ins = d[i, j - 1] + 1;   // an insertion
                    int sub = d[i - 1, j - 1] + 1; // a substitution
                    // the next two lines are the modification I've made
                    //int insDel = (i < j) ? ins : del;
                    //d[i, j] = (i == j) ? sub : Math.Min(insDel, sub);
                    // the following 8 lines are a clearer version of the above 2 lines 
                    if (i == j) {
                        d[i, j] = sub;
                    } else {
                        int insDel;
                        if (i < j) insDel = ins; else insDel = del;
                        // assign the smaller of insDel or sub
                        d[i, j] = Math.Min(insDel, sub);
                    }
                }
            }
        }
        return d[m, n];
    }