Damerau-Levenshtein distance between two vectors

805 views Asked by At

The Damerau-Levenshtein distance between the two strings "abc" and "acb" would be 1, because it involves one transposition between "b" and "c".

> stringdist("abc", "acb", method = "dl")
[1] 1

Now suppose that I have the following two character vectors:

A = c("apple", "banana", "citrus")
B = c("apple", "citrus", "banana")

How can I calculate the Damerau-Levenshtein distance between A and B so that the result is identical to the distance between "abc" and "acb", since there is one transposition between "citrus" and "banana"? In other words, how can I calculate the Damerau-Levenshtein distance between A and B so that each item is counted as one character within a string?

2

There are 2 answers

0
Oliver On

What about

vecdist <- function(x, y){
  matches <- match(x, y, nomatch = 0)
  nomatch <- matches == 0
  # No match = we need 1 permutation
  # Other matches: Compare index, for each "not inverted" index, (not 3 vs -3) we need 1 permutation
  perm_match <- (matches - seq_along(matches))[!nomatch]
  perm_n <- sum(perm_match != 0) - sum(duplicated(abs(perm_match)))
  sum(nomatch) + perm_n + sum(!y %in% x)
}

The basic idea here is:

  1. Check for missing matches in x vs y and visa-versa. Each of these are 1 permutation
  2. For the remainders we need to check the match indices. Here I use a small trick, by checking whether any fields has to be switched "with each other" using duplicated(abs(...)). Eg abcd, badc is 2 permutations while abcd, bdca is 3.

That is very similar to how stringdist works for single strings.

A = c("apple", "banana", "citrus")
B = c("apple", "citrus", "banana")
vecdist(A, B)
[1] 1
A <- c(A, 'pear')
vecdist(A, B)
[1] 2
vecdist(B, A)
[1] 2
A <- c('apple', 'banana', 'citrus', 'pear')
B <- c('pear', 'citrus', 'banana', 'apple')
vecdist(A, B)
[1] 2
vecdist(B, A)
[1] 2
A <- c('apple', 'banana', 'citrus', 'pear')
B <- c('pear', 'citrus', 'apple', 'banana')
vecdist(A, B)
[1] 3
vecdist(B, A)
[1] 3
0
Ian On
library(stringdist)
library(tidyr)

A = c("apple", "banana", "citrus")
B = c("apple", "citrus", "banana")

a <- factor(A, levels = union(A,B)) %>% 
  as.numeric() %>% 
  sapply(function(i) letters[i]
         %>% paste0(collapse = "")
  ) %>%
  paste0(collapse = "")

b <- factor(B, levels = union(A,B)) %>% 
  as.numeric() %>% 
  sapply(function(i) letters[i]
             %>% paste0(collapse = "")
         ) %>%
  paste0(collapse = "")

stringdist(a, b, method = "dl")