I have 2 data frames containing short (length == 20) sequences that I want to compare with string distance analysis techniques, returning highly similar sequences with a hamming distance of no greater than 3 (i.e. no more than 3 substitutions between the query and subject sequences). fuzzyjoin::stringdist_join() accomplishes this task well, but it can't handle the quantity of sequences I want to compare (tens to hundreds of thousands of sequences in each data frame) unless I chunk in the query sequences. When my data frames are on the larger side of things, this tactic starts taking full days to execute with the code below.
Is there some way I can use fuzzyjoin or stringdist packages with data.table to speed things up and preserve memory? I keep trying various things but they result in even slower execution speed.
library(tidyverse)
library(fuzzyjoin)
### simulate data ###
chars <- c("A", "C", "G", "T")
nq <- 50051
ns <- 54277
query <- data.frame(name = str_c("q", 1:nq),
seq = replicate(nq, sample(chars, 20, replace = T) %>% paste0(collapse = "")))
subject <- data.frame(name = str_c("s", 1:ns),
seq = replicate(ns, sample(chars, 20, replace = T) %>% paste0(collapse = "")))
### return seqs with 3 or less mismatches ###
# # NOT ENOUGH MEMORY
# stringdist_join(query, subject,
# by = "seq",
# method = "hamming",
# mode = "left",
# max_dist = 3,
# distance_col = "mismatches")
# chunk query values to preserve memory
query <- query %>%
mutate(grp = (plyr::round_any(row_number(), 100)/100)+1)
# get a variable of all groups
var.grps <- unique(query$grp)
# create an output list
df_out <- purrr::map_df(var.grps, function(i) {
q <- filter(query, grp == i)
dat <- stringdist_join(q, subject,
by = "seq",
max_dist = 3,
method = "hamming",
mode = "left",
ignore_case = TRUE,
distance_col = "mismatch")
return(dat)
})
I figured it out: stringdist_join() uses stringdistmatrix() behind the scenes. It's much faster to just use stringdistmatrix() and collect your desired information from it. To overcome memory issues, I chunked in the query sequences with an initial empty matrix.