How to maximize R fuzzyjoin/stringdist speed and memory efficiency

123 views Asked by At

I have 2 data frames containing short (length == 20) sequences that I want to compare with string distance analysis techniques, returning highly similar sequences with a hamming distance of no greater than 3 (i.e. no more than 3 substitutions between the query and subject sequences). fuzzyjoin::stringdist_join() accomplishes this task well, but it can't handle the quantity of sequences I want to compare (tens to hundreds of thousands of sequences in each data frame) unless I chunk in the query sequences. When my data frames are on the larger side of things, this tactic starts taking full days to execute with the code below.

Is there some way I can use fuzzyjoin or stringdist packages with data.table to speed things up and preserve memory? I keep trying various things but they result in even slower execution speed.

library(tidyverse)
library(fuzzyjoin)

### simulate data ###

chars <- c("A", "C", "G", "T")
nq <- 50051
ns <- 54277
query <- data.frame(name = str_c("q", 1:nq), 
                    seq = replicate(nq, sample(chars, 20, replace = T) %>% paste0(collapse = "")))
subject <- data.frame(name = str_c("s", 1:ns),
                      seq = replicate(ns, sample(chars, 20, replace = T) %>% paste0(collapse = "")))

### return seqs with 3 or less mismatches ###

# # NOT ENOUGH MEMORY
# stringdist_join(query, subject,
#                 by = "seq",
#                 method = "hamming",
#                 mode = "left",
#                 max_dist = 3,
#                 distance_col = "mismatches")

# chunk query values to preserve memory
query <- query %>%
  mutate(grp = (plyr::round_any(row_number(), 100)/100)+1)

# get a variable of all groups
var.grps <- unique(query$grp)

# create an output list
df_out <- purrr::map_df(var.grps, function(i) {
  q <- filter(query, grp == i)
  dat <- stringdist_join(q, subject,
                         by = "seq",
                         max_dist = 3,
                         method = "hamming",
                         mode = "left",
                         ignore_case = TRUE,
                         distance_col = "mismatch")
  return(dat)
})
1

There are 1 answers

0
Bryan On BEST ANSWER

I figured it out: stringdist_join() uses stringdistmatrix() behind the scenes. It's much faster to just use stringdistmatrix() and collect your desired information from it. To overcome memory issues, I chunked in the query sequences with an initial empty matrix.

# make stringdist matrix 
chunk_size <- 1000
num_rows <- nrow(query)

# Initialize an empty matrix
sdm <- matrix(0, nrow = num_rows, ncol = nrow(subject))

# Loop through the rows in chunks
for (start_row in seq(1, num_rows, by = chunk_size)) {
  end_row <- min(start_row + chunk_size - 1, num_rows)
  
  # Subset the rows for the current chunk
  chunk_query <- query$seq[start_row:end_row]
  
  # Compute stringdist matrix for the current chunk
  chunk_sdm <- stringdistmatrix(chunk_query, subject$seq, method = "hamming")
  
  # Assign the chunk_sdm to the corresponding rows in the main sdm matrix
  sdm[start_row:end_row, ] <- chunk_sdm
}
rownames(sdm) <- query$name
colnames(sdm) <- subject$name

# find the indices where the values are 3 or less
indices <- which(sdm <= 3, arr.ind = TRUE)

# extract row names, col names, and values based on the indices
result <- data.frame(query = rownames(sdm)[indices[, 1]],
                     subject = colnames(sdm)[indices[, 2]],
                     mismatch = sdm[indices])