I have a for loop that takes a user's input and one of the keys in my dictionary and passes them to a Damerau-Levenshtein function and based on the distance, overwrites the user's input with the dictionary key (The for loop is to cycle through each dictionary key). This works fine enough for strings larger than three characters, but if the string is three or fewer characters the algorithm returns with the wrong key. Here's the for loop:
1950 For j = 0 To dict.Count - 1
1960 distance = DamerauLevenshtein(SplitStr(i), dict.Keys(j))
1970 'MsgBox dict.Keys(j) & vbCrLf & distance ' used for debugging
1980 If distance < 4 Then
1990 If distance < leastDist Then
2000 leastDist = distance
2010 SplitStr(i) = dict.Keys(j)
2020 End If
2030 End If
2040 Next
2050 MsgBox "The distance is: " & leastDist & vbCrLf & "The entered text was " & tempStr & vbCrLf & "The replaced word is " & SplitStr(i)
SplitStr(i) holds the user's input, which comes from a split function. I arbitrarily picked 4 for a good distance
I stole the algorithm from a bytes.com forum post. Algorithm below:
Function DamerauLevenshtein(str1, str2, Optional intSize = 256)
Dim intTotalLen, arrDistance, intLen1, intLen2, i, j, arrStr1, arrStr2, arrDA, intMini
Dim intDB, intI1, intJ1, intD
str1 = UCase(str1)
str2 = UCase(str2)
intLen1 = Len(str1)
intLen2 = Len(str2)
intTotalLen = intLen1 + intLen2
ReDim arrStr1(intLen1)
ReDim arrStr2(intLen2)
ReDim arrDA(intSize)
ReDim arrDistance(intLen1 + 2, intLen2 + 2)
arrDistance(0, 0) = intTotalLen
For i = 0 To intSize - 1
arrDA(i) = 0
Next
For i = 0 To intLen1
arrDistance(i + 1, 1) = i
arrDistance(i + 1, 0) = intTotalLen
Next
For i = 1 To intLen1
arrStr1(i - 1) = Asc(Mid(str1, i, 1))
Next
For j = 0 To intLen2
arrDistance(1, j + 1) = j
arrDistance(0, j + 1) = intTotalLen
Next
For j = 1 To intLen2
arrStr2(j - 1) = Asc(Mid(str2, j, 1))
Next
For i = 1 To intLen1
intDB = 0
For j = 1 To intLen2
intI1 = arrDA(arrStr2(j - 1))
intJ1 = intDB
If arrStr1(i - 1) = arrStr2(j - 1) Then
intD = 0
Else
intD = 1
End If
If intD = 0 Then intDB = j
intMini = arrDistance(i, j) + intD
If intMini > arrDistance(i + 1, j) + 1 Then intMini = arrDistance(i + 1, j) + 1
If intMini > arrDistance(i, j + 1) + 1 Then intMini = arrDistance(i, j + 1) + 1
If intMini > arrDistance(intI1, intJ1) + i - intI1 + j - intJ1 - 1 Then intMini = arrDistance(intI1, intJ1) + i - intI1 + j - intJ1 - 1
arrDistance(i + 1, j + 1) = intMini
Next
arrDA(arrStr1(i - 1)) = i
Next
DamerauLevenshtein = arrDistance(intLen1 + 1, intLen2 + 1)
End Function
If I type in "Cire" the algorithm correctly returns "CORE".
"Raman" returns "REMAN" "Cosnigned" returns "CONSIGNED
However, "Now" should return "New" but returns "OCM".
"New" also returns "OCM" (so distance should be 0, but is 2.)
"FP" should be "FP" but returns "OCM", distance is 2
"DPF" Should be "DPF" but returns "OCM", distance is 2
I just learned about the algorithm, so I'm sure I'm missing something important, but I just can't see it. Thoughts?
I figured it out. After much searching I found a post saying that an edit distance is commonly 2. (They didn't specify any merits on why 2 is common)
I switched my if statement to 2 from 4 and now all of the problem terms are being corrected as they should be.