How to choose the proper maximum value for Damerau-Levenshtein distance?

1.7k views Asked by At

I am using the Damerau-Levenshtein code available from here in my similarity measurements. The problem is that when I apply the Damerau-Levenshtein on two strings such as cat sat on a mat and dog sat mat, I am getting edit distance as 8. This similarity results can get any number regarding insertion, deletion or substitution like any range from 0, 1, 2, ... . Now I am wondering if there is any way that we can assume or find a maximum of this distance (similarity) and converted between 0 and 1 or how can we set the max value that at least I can say: distance =1 - similarity.
The reason for this post is that I am setting a threshold for a few distance metrics like cosine, Levenstein and damerau levenstein and outputs of all should be betweeb zero and 1.

2

There are 2 answers

0
DSBLR On

Levenshtein Distance score = number of insertion + number of deletion + number of substitution.

So the maximum value is 3 X(multiplied) the maximum length string in your data set.

0
pjmaracs On

The difficult thing is that the upper bound of Damerau-Levenshtein is infinite (given infinitely long words), but we can't practically make infinite strings.

If you wanted to be safe, you can use something that maps the range 0-> max length of a string onto the range 0->1. The max length of a string depends on the amount of memory you have (assuming 64 bit), so I'd recommend doing...not this. Source

Practically, you can also just check all of the strings you are about to compare and choose the length of the longest string in that list as the max value. Another solution is to compute all of the scores beforehand and apply the conversion factor after you know the max score. Some code that could do that:

def adjustScore(lists, maxNum):
    scaleFactor = 1/maxNum
    return [x * scaleFactor for x in lists]

testWords = ["test1", "testing2", "you", "must", "construct", "additional", "plyometrics"]
testScores = []
for i in range(len(testWords)-1):
    testScores.append(damerau_levenshtein_distance(testWords[i], testWords[i+1]))

#method 1: just check the biggest score you got to obtain the max
max1 = max(testScores)
result = adjustScore(testScores, max1)

#method 2: if you need the adjusted score first, pick the longest string's length as max
lens = map(len, testWords)
max2 = max(lens)
result2 = adjustScore(testScores, max2)

These happen to give identical answers because most of the words are very different from each other but either of these approaches should work for most cases.
Long story short, the max distance between two strings is the length of the longer string.

Notes: if this maps in the wrong direction (i.e. high scores are showing low and vice versa, just add "1-" between the open bracket and x in adjustscore)
Also, if you want it to map do a different range, replace the 1 with a different max value.