I have some 40000 words and want to find all similar pairs. For similarity, I use a soft of Damerau–Levenshtein distance scaled by the word lengths. For simplicity, I don't consider overlapping edits (just like the linked algorithm). All words (most of them being German, French or English) were converted to lowercase as case caries no information in our data. I did two modifications to the distance computation
- the distance between two characters is
- 0, when they're the same
- 0.2, when they differ just in the accent (like
a
vsä
orà
) - 0.2, when they're
s
andß
(the German Sharp S) - 1, otherwise
Additionally, the distance of the strings ß
and ss
is set to 0.2. Our data shows that this complication is necessary.
For finding all similar pairs, my idea was to only consider pairs found via a common n-gram, but this fails for short words (which is acceptable) and in general because of the above modifications.
My next idea was an early abort of the distance computation, if it's known that the result is over a threshold (say 4). However, as many words have common prefixes, this abort comes too late to being a nice speed-up. Reversing the words fails (not so badly) on common suffixes.
The whole computation for all 20e6 pairs takes some five minutes, so for now, it's possible to do it once and store all the results (only distances below a threshold are needed). But I'm looking for something more future-proof.
Is it possible to quickly compute a good lower bound on the Damerau–Levenshtein distance (ideally allowing an early exit)?
Is it possible with the above modifications? Note that e.g., "at least the difference of the sizes of the two strings" doesn't hold because of the second modification.
The code
public final class Levenshtein {
private static final class CharDistance {
private static int distance(char c0, char c1) {
if (c0 == c1) return 0;
if ((c0|c1) < 0x80) return SCALE;
return c0<=c1 ? distanceInternal(c0, c1) : distanceInternal(c1, c0);
}
private static int distanceInternal(char c0, char c1) {
assert c0 <= c1;
final String key = c0 + " " + c1;
{
final Integer result = CACHE.get(key);
if (result != null) return result.intValue();
}
final int result = distanceUncached(c0, c1);
CACHE.put(key, Integer.valueOf(result));
return result;
}
private static int distanceUncached(char c0, char c1) {
final String norm0 = Normalizer.normalize("" + c0, Normalizer.Form.NFD).replaceAll("[^\\p{ASCII}]", "");
final String norm1 = Normalizer.normalize("" + c1, Normalizer.Form.NFD).replaceAll("[^\\p{ASCII}]", "");
if (norm0.equals(norm1)) return DIACRITICS;
assert c0 <= c1;
if (c0=='s' && c1=='ß') return DIACRITICS;
return SCALE;
}
private static final Map<String, Integer> CACHE = new ConcurrentHashMap<>();
}
/**
* Return the scaled distance between {@code s0} and {@code s1}, if it's below {@code limit}.
* Otherwise, return some lower bound ({@code >= limit}.
*/
int distance(String s0, String s1, int limit) {
final int len0 = s0.length();
final int len1 = s1.length();
int result = SCALE * (len0 + len1);
final int[] array = new int[len0 * len1];
for (int i0=0; i0<len0; ++i0) {
final char c0 = s0.charAt(i0);
for (int i1=0; i1<len1; ++i1) {
final char c1 = s1.charAt(i1);
final int d = CharDistance.distance(c0, c1);
// Append c0 and c1 respectively.
result = get(len0, array, i0-1, i1-1) + d;
// Append c0.
result = Math.min(result, get(len0, array, i0-1, i1) + SCALE);
// Append c1.
result = Math.min(result, get(len0, array, i0, i1-1) + SCALE);
// Handle the "ß" <-> "ss" substitution.
if (c0=='ß' && c1=='s' && i1>0 && s1.charAt(i1-1)=='s') result = Math.min(result, get(len0, array, i0-1, i1-2) + DIACRITICS);
if (c1=='ß' && c0=='s' && i0>0 && s0.charAt(i0-1)=='s') result = Math.min(result, get(len0, array, i0-2, i1-1) + DIACRITICS);
// Handle a transposition.
if (i0>0 && i1>0 && s0.charAt(i0-1)==c1 && s1.charAt(i1-1)==c0) result = Math.min(result, get(len0, array, i0-2, i1-2) + SCALE);
set(len0, array, i0, i1, result);
}
// Early exit.
{
final int j = i0 - len0 + len1;
final int lowerBound = get(len0, array, i0, j);
if (lowerBound >= limit) return lowerBound;
}
}
return result;
}
// Simulate reading from a 2D array at indexes i0 and i1;
private int get(int stride, int[] array, int i0, int i1) {
if (i0<0 || i1<0) return SCALE * (i0+i1+2);
return array[i1*stride + i0];
}
// Simulate writing to a 2D array at indexes i0 and i1;
private void set(int stride, int[] array, int i0, int i1, int value) {
array[i1*stride + i0] = value;
}
private static final int SCALE = 10;
private static final int DIACRITICS = 2;
}
Example words
rotwein
rotweincuv
rotweincuvee
rotweincuveé
rotweincuvée
rotweincuvúe
rotweindekanter
rotweinessig
rotweinfass
rotweinglas
rotweinkelch
rotweißkomposition
rotwild
roug
rouge
rougeaoc
rougeaop
rougeots
rougers
rouges
rougeáaop
rough
roughstock
rouladen
roulette
roumier
roumieu
round
rounded
roundhouse
rounds
rouret
rouss
roussanne
rousseau
roussi
roussillion
roussillon
route
rouvinez
rove
roveglia
rovere
roveri
rovertondo
rovo
rowan
rowein
roxburgh
roxx
roy
roya
royal
royalbl
royalblau
royaldanishnavy
royale
royales
royaline
royals
royer
royere
roze
rozenberg
rozes
rozier
rozès
rozés
roßberg
roßdorfer
roßerer
rpa
rr
rrvb
rry
rs
rsaftschorle
rsbaron
rscastillo
rsgut
rsl
rstenapfel
rstenberg
rstenbrõu
rt
rtd
rtebecker
rtebeker
ru
ruadh
ruanda
rub
rubaiyat
ruban
rubata
rubblez
rubenkov
rubeno
rubentino
ruber
I would suggest putting all of the words into a trie, and then recursively searching the trie against itself.
Generating the trie should be fast, and now matching of common prefixes against each other is only calculated once no matter how many words share them.
You do have to keep track of a lot of state as you wander the trie, because your state is, "All of the intermediate stuff we could be in the middle of calculating."