I have a list of character vectors representing words split in phonemes:

> head(words)
[[1]]
[1] "UU"

[[2]]
[1] "EY" "Z" 

[[3]]
[1] "T"  "R"  "IH" "P"  "UU" "L"  "EY"

[[4]]
[1] "AA" "B"  "ER" "G" 

[[5]]
[1] "AA" "K"  "UU" "N" 

[[6]]
[1] "AA" "K"  "ER"

For each word in the list, I would like to find the number of words that differ from the considered word by one phoneme (one phoneme added, subtracted or substituted) and have the same number of phonemes in the same positions. In this sense, for the word "EY" "Z" acceptable cases would be:

[1] "M"  "EY" "Z" 

[1] "AY" "Z"

[1] "EY" "D" 

[1] "EY" "Z" "AH"

But the following cases should be rejected:

[1] "EY" "D"  "Z"

[1] "Z" "EY" "D"

[1] "HH" "EY"

Basically, I would like to find differences of one element respecting the positions of the phonemes in the vectors. At the moment, the best solution I have found is:

diffs <- c()
for (i in seq_along(words)) {
  diffs <- c(diffs, sum(sapply(words, function(y) {
    count <- 0
    elements <- list(words[[i]], y)
    len <- c(length(words[[i]]), length(y))
    if (identical(elements[which(len==max(len))][[1]][-1], elements[which(len==min(len))][[1]]) == 1) {
      count + identical(elements[which(len==max(len))][[1]][-1], elements[which(len==min(len))][[1]])
    } else {
      length(elements[which(len==min(len))][[1]]) <- length(elements[which(len==max(len))][[1]])
      elements <- rapply(elements, f=function(x) ifelse(is.na(x),"$$",x), how="replace" )
      count + sum(elements[[1]] != elements[[2]])
    }
  })== 1))
}

However, this solution is taking ages because my list words has 120.000 elements (words/vectors), so I would like to ask if you know other solutions to speed up the process.

Thank you very much in advance for your answers

3

There are 3 answers

3
Christoph Wolk On BEST ANSWER

And a different answer, using regular Levenshtein distance (i.e. allowing insertions at any point), but this time FAST - 1000 words in 15 seconds fast.

The trick is using one of the fast Levenshtein implementations available in R packages; in this case I'm using stringdist but any should work. The issue is that they operate on strings and characters, not multi-character phoneme representations. But there's a trivial solution for that: as there are more characters than phonemes, we can just translate the phonemes into single characters. The resulting strings are unreadable as phonemic transcriptions, but work perfectly fine as input to the neighborhood density algorithm.

library(stringdist)

phonemes <- unique(unlist(words))

# add a few buffer characters
targets <- c(letters, LETTERS, 0:9, "!", "§", "%", "&", "/", "=", 
             "#")[1:length(phonemes)]

ptmap <- targets
names(ptmap) <- phonemes

wordsT <- sapply(words, function(i) paste0(ptmap[i], collapse=""))

wordlengths <- nchar(wordsT)

onediffs.M <- function(x) {
  lengthdiff <-  abs(wordlengths - nchar(x))
  sum(stringdist(x, wordsT[lengthdiff == 0], method="hamming") == 1) +
    sum(stringdist(x, wordsT[lengthdiff == 1], method="lv") == 1)
}
6
Christoph Wolk On

Here's a version using Levenshtein Distance with the Wagner-Fischer algorithm.

vecLeven <- function(s, t) {
  d <- matrix(0, nrow = length(s) + 1, ncol=length(t) + 1)
  d[, 1] <- (1:nrow(d)) - 1
  d[1,] <- (1:ncol(d))-1
  for (i in 1:length(s))  {
    for (j in 1:length(t)) {
      d[i+1, j+1] <- min(
        d[i, j+1] + 1, # deletion
        d[i+1, j] + 1, # insertion
        d[i, j] + if (s[i] == t[j]) 0 else 1 # substitution
      )
    }
  }

  d[nrow(d), ncol(d)]
}


onediff <- sapply(words[1:10], function(x) {
  lengthdiff <- sapply(words, function(word) abs(length(word) - length(x)))
  sum(sapply(words[lengthdiff == 0], function(word) sum(word != x) == 1)) +
        sum(mapply(vecLeven, list(x), words[lengthdiff == 1]) == 1)
})

I tested both versions on the CMU dictionary, which has a similar size. It's a bit faster than your version (about 30 seconds instead of 50 for 10 words), and should parallelize well. Still, running it on the complete data set would take several days.

One large performance factor is that all pairs are computed twice, once for th first word and once for the second; doing a lookup instead would halve that. However, there are more than 7 billion pairs, so you would need a database to store them.

2
F. Privé On

So, the key here is to separate words with respect to their lengths so that we can test each asumption (substitution/addition/deletion) only on a subset of interest.

get_one_diff <- function(words) {

  K <- max(le <- lengths(words))
  i_chr <- as.character(seq_len(K))
  words.spl <- split(words, le)

  test_substitution <- function(i) {
    word1 <- words[[i]]
    do.call(sum, lapply(words.spl[[i_chr[le[i]]]], function(word2) {
      sum(word1 != word2) == 1
    }))
  }

  test_addition <- function(i) {
    if ((le <- le[i]) == K) return(0)
    word1 <- words[[i]]
    do.call(sum, lapply(words.spl[[i_chr[le + 1]]], function(word2) {
      isOneDiff(word1, word2)
    }))
  }

  test_deletion <- function(i) {
    if ((le <- le[i]) == 1) return(0)
    word1 <- words[[i]]
    do.call(sum, lapply(words.spl[[i_chr[le - 1]]], function(word2) {
      isOneDiff(word2, word1)
    }))
  }

  sapply(seq_along(words), function(i) {
    test_substitution(i) + test_addition(i) + test_deletion(i)
  })
}

where isOneDiff is an Rcpp function:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
bool isOneDiff(const StringVector& w1,
               const StringVector& w2) {

  int i, n = w1.size();

  for (i = 0; i < n; i++) if (w1[i] != w2[i]) break;
  for (     ; i < n; i++) if (w1[i] != w2[i+1]) return false;

  return true;
}

This is 20 times as fast as your version and as it is merely an sapply, it could be easily parallelized.