I have implemented a Levenshtein distance algorithm using a trie tree, as described here by Steve Hanov. However, I'm having difficulty handling special characters. For instance, if I calculate the distance between Großmann and Grossmann, I need the distance to be zero, since ß and ss should be considered equal.
What would be the best solution (if any) to support these special cases.
My initial thought was to kind of normalize all strings before calculating the distance. So in Großmann -> Grossman, österreich -> oesterreich, ... However, there seems to be no such functionality in .NET?
The challenge is that the current culture does not identify the language of the individual words.
Assume you are willing to error on the side of match.
Identify a set of characters that never need to be mapped.
Identify a set mapping for all cultures.
Identify mappings for specific cultures.
First do an unmapped Levenshtein distance.
If the unmapped distance is is zero then stop.
If the unmapped distance is greater than x (e.g. 4) then stop as it is not a match.
If the word only has characters that never needs to be mapped (e.g. a-z) then stop.
Map both to all cultures and if the distance is zero stop.
Map to the default culture and if the distance is zero stop.
Map to other cultures and if the distance is zero stop.
And I added a straight string.compare to the Levenshtein to report 0 if true.