R: find and count all differences by position (of one element added, subtracted or substituted) between character vectors nested in a list

Question

R: find and count all differences by position (of one element added, subtracted or substituted) between character vectors nested in a list

293 views Asked by Francesco Cabiddu At 01 September 2017 at 15:52

I have a list of character vectors representing words split in phonemes:

> head(words)
[[1]]
[1] "UU"

[[2]]
[1] "EY" "Z" 

[[3]]
[1] "T"  "R"  "IH" "P"  "UU" "L"  "EY"

[[4]]
[1] "AA" "B"  "ER" "G" 

[[5]]
[1] "AA" "K"  "UU" "N" 

[[6]]
[1] "AA" "K"  "ER"

For each word in the list, I would like to find the number of words that differ from the considered word by one phoneme (one phoneme added, subtracted or substituted) and have the same number of phonemes in the same positions. In this sense, for the word "EY" "Z" acceptable cases would be:

[1] "M"  "EY" "Z" 

[1] "AY" "Z"

[1] "EY" "D" 

[1] "EY" "Z" "AH"

But the following cases should be rejected:

[1] "EY" "D"  "Z"

[1] "Z" "EY" "D"

[1] "HH" "EY"

Basically, I would like to find differences of one element respecting the positions of the phonemes in the vectors. At the moment, the best solution I have found is:

diffs <- c()
for (i in seq_along(words)) {
  diffs <- c(diffs, sum(sapply(words, function(y) {
    count <- 0
    elements <- list(words[[i]], y)
    len <- c(length(words[[i]]), length(y))
    if (identical(elements[which(len==max(len))][[1]][-1], elements[which(len==min(len))][[1]]) == 1) {
      count + identical(elements[which(len==max(len))][[1]][-1], elements[which(len==min(len))][[1]])
    } else {
      length(elements[which(len==min(len))][[1]]) <- length(elements[which(len==max(len))][[1]])
      elements <- rapply(elements, f=function(x) ifelse(is.na(x),"$$",x), how="replace" )
      count + sum(elements[[1]] != elements[[2]])
    }
  })== 1))
}

However, this solution is taking ages because my list words has 120.000 elements (words/vectors), so I would like to ask if you know other solutions to speed up the process.

Thank you very much in advance for your answers

Original Q&A

There are 3 answers

Christoph Wolk On 01 September 2017 at 18:14

Here's a version using Levenshtein Distance with the Wagner-Fischer algorithm.

vecLeven <- function(s, t) {
  d <- matrix(0, nrow = length(s) + 1, ncol=length(t) + 1)
  d[, 1] <- (1:nrow(d)) - 1
  d[1,] <- (1:ncol(d))-1
  for (i in 1:length(s))  {
    for (j in 1:length(t)) {
      d[i+1, j+1] <- min(
        d[i, j+1] + 1, # deletion
        d[i+1, j] + 1, # insertion
        d[i, j] + if (s[i] == t[j]) 0 else 1 # substitution
      )
    }
  }

  d[nrow(d), ncol(d)]
}


onediff <- sapply(words[1:10], function(x) {
  lengthdiff <- sapply(words, function(word) abs(length(word) - length(x)))
  sum(sapply(words[lengthdiff == 0], function(word) sum(word != x) == 1)) +
        sum(mapply(vecLeven, list(x), words[lengthdiff == 1]) == 1)
})

I tested both versions on the CMU dictionary, which has a similar size. It's a bit faster than your version (about 30 seconds instead of 50 for 10 words), and should parallelize well. Still, running it on the complete data set would take several days.

One large performance factor is that all pairs are computed twice, once for th first word and once for the second; doing a lookup instead would halve that. However, there are more than 7 billion pairs, so you would need a database to store them.

F. Privé On 02 September 2017 at 21:40

So, the key here is to separate words with respect to their lengths so that we can test each asumption (substitution/addition/deletion) only on a subset of interest.

get_one_diff <- function(words) {

  K <- max(le <- lengths(words))
  i_chr <- as.character(seq_len(K))
  words.spl <- split(words, le)

  test_substitution <- function(i) {
    word1 <- words[[i]]
    do.call(sum, lapply(words.spl[[i_chr[le[i]]]], function(word2) {
      sum(word1 != word2) == 1
    }))
  }

  test_addition <- function(i) {
    if ((le <- le[i]) == K) return(0)
    word1 <- words[[i]]
    do.call(sum, lapply(words.spl[[i_chr[le + 1]]], function(word2) {
      isOneDiff(word1, word2)
    }))
  }

  test_deletion <- function(i) {
    if ((le <- le[i]) == 1) return(0)
    word1 <- words[[i]]
    do.call(sum, lapply(words.spl[[i_chr[le - 1]]], function(word2) {
      isOneDiff(word2, word1)
    }))
  }

  sapply(seq_along(words), function(i) {
    test_substitution(i) + test_addition(i) + test_deletion(i)
  })
}

where isOneDiff is an Rcpp function:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
bool isOneDiff(const StringVector& w1,
               const StringVector& w2) {

  int i, n = w1.size();

  for (i = 0; i < n; i++) if (w1[i] != w2[i]) break;
  for (     ; i < n; i++) if (w1[i] != w2[i+1]) return false;

  return true;
}

This is 20 times as fast as your version and as it is merely an sapply, it could be easily parallelized.

**Christoph Wolk** · Accepted Answer · 2017-09-03T21:04:09+00:00

And a different answer, using regular Levenshtein distance (i.e. allowing insertions at any point), but this time FAST - 1000 words in 15 seconds fast.

The trick is using one of the fast Levenshtein implementations available in R packages; in this case I'm using stringdist but any should work. The issue is that they operate on strings and characters, not multi-character phoneme representations. But there's a trivial solution for that: as there are more characters than phonemes, we can just translate the phonemes into single characters. The resulting strings are unreadable as phonemic transcriptions, but work perfectly fine as input to the neighborhood density algorithm.

library(stringdist)

phonemes <- unique(unlist(words))

# add a few buffer characters
targets <- c(letters, LETTERS, 0:9, "!", "§", "%", "&", "/", "=", 
             "#")[1:length(phonemes)]

ptmap <- targets
names(ptmap) <- phonemes

wordsT <- sapply(words, function(i) paste0(ptmap[i], collapse=""))

wordlengths <- nchar(wordsT)

onediffs.M <- function(x) {
  lengthdiff <-  abs(wordlengths - nchar(x))
  sum(stringdist(x, wordsT[lengthdiff == 0], method="hamming") == 1) +
    sum(stringdist(x, wordsT[lengthdiff == 1], method="lv") == 1)
}

TechQA.

R: find and count all differences by position (of one element added, subtracted or substituted) between character vectors nested in a list

There are 3 answers

Related Questions in R

Related Questions in PERFORMANCE

Related Questions in POSITION

Related Questions in SET-DIFFERENCE

Popular Questions

Popular Tags

Trending Questions