Is it possible to speed this function using dplyr and furrr up?

Question

Is it possible to speed this function using dplyr and furrr up?

183 views Asked by avgoustisw At 18 April 2022 at 12:56

I have a data frame where rows correspond to documents and columns capture individual words in those documents.

library(tidyverse)
library(furrr)
#> Loading required package: future

doc_by_word_df <- structure(list(
 doc_id = c("doc1.txt", "doc2.txt", "doc3.txt"),
 kwpe_1 = c("apple", "fish", "apple"),
 kwpe_2 = c("bananna", "grain", "insects"),
 kwpe_3 = c("carrot", "insects", "grain")),
 class = c("tbl_df", "tbl", "data.frame"),
 row.names = c(NA,-3L))

doc_by_word_df
#> # A tibble: 3 × 4
#>   doc_id   kwpe_1 kwpe_2  kwpe_3 
#>   <chr>    <chr>  <chr>   <chr>  
#> 1 doc1.txt apple  bananna carrot 
#> 2 doc2.txt fish   grain   insects
#> 3 doc3.txt apple  insects grain

I would like to identify all the documents containing any of the possible pair combinations of words in those documents.

To do that I have created a vector of all the words in the data set and extracted all the unique word-pair combinations.

all_words <- c("apple", "fish", "apple", "bananna", "grain", "insects", "carrot", "insects", "grain")

unique_keyword_pair <- combn(unique(all_words), 2)

unique_keyword_pair
#>      [,1]    [,2]      [,3]    [,4]      [,5]     [,6]      [,7]    [,8]     
#> [1,] "apple" "apple"   "apple" "apple"   "apple"  "fish"    "fish"  "fish"   
#> [2,] "fish"  "bananna" "grain" "insects" "carrot" "bananna" "grain" "insects"
#>      [,9]     [,10]     [,11]     [,12]     [,13]     [,14]    [,15]    
#> [1,] "fish"   "bananna" "bananna" "bananna" "grain"   "grain"  "insects"
#> [2,] "carrot" "grain"   "insects" "carrot"  "insects" "carrot" "carrot"

I've made a function that uses the unique word-pairs to filter out all the documents containing those word-pairs and mapped that function over the data frame.

It works in the way I'd like it to, but it takes a very long time to run. I have used the furrr package to try to speed this up, but I'm still left with a very long run time. Originally I did this with a for loop; I was under the impression that using the a map function would shorten things--but I don't think it made much difference.

I do not know enough about this sort of thing to sort out what I can do to reduce the length of time it takes to run this function. I suspect it has to do with the massive number of word-pair combinations being run through the filter function, but beyond that I'm unsure.

Any suggestions would be appreciated.

docs_word_pairs <- function(x) {
 doc_by_word_df %>% 
  filter(if_any(-doc_id, ~ . %in% unique_keyword_pair[,x][1]) & 
          if_any(-doc_id, ~ . %in% unique_keyword_pair[,x][2])) %>% 
  mutate(keyword_pair = paste(c(unique_keyword_pair[,x][1],
                                unique_keyword_pair[,x][2]), 
                              collapse = "-"),
         keyword_1 = unique_keyword_pair[,x][1],
         keyword_2 = unique_keyword_pair[,x][2]) %>% 
  relocate(keyword_pair:keyword_2, .before = doc_id) %>% 
  group_by(keyword_pair) %>%
  summarize(n = n())
}

num_unique_keyword_pair <- length(unique_keyword_pair)/2

seq_num_unique_keyword_pair <- rep(c(1:num_unique_keyword_pair))

future::plan(multisession)

seq_num_unique_keyword_pair %>% 
 future_map_dfr(docs_word_pairs)
#> # A tibble: 8 × 2
#>   keyword_pair       n
#>   <chr>          <int>
#> 1 apple-bananna      1 # one document contains this key word pair
#> 2 apple-grain        1
#> 3 apple-insects      1
#> 4 apple-carrot       1
#> 5 fish-grain         1
#> 6 fish-insects       1
#> 7 bananna-carrot     1
#> 8 grain-insects      2 # two documents contain this key word pair

^{Created on 2022-04-18 by the reprex package (v2.0.1)}

Original Q&A

There are 1 answers

**Onyambu** · Accepted Answer · 2022-04-18T13:41:39+00:00

This can quicly be done as shown below:

as.dist(crossprod(table(cbind(doc_by_word_df[,1],unlist(doc_by_word_df[-1])))))

        apple bananna carrot fish grain
bananna     1                          
carrot      1       1                  
fish        0       0      0           
grain       1       0      0    1      
insects     1       0      0    1     2

or even

doc_by_word_df %>%
  pivot_longer(-doc_id) %>%
  select(-name) %>%
  table() %>%
  crossprod() %>%
  as.dist()

        apple bananna carrot fish grain
bananna     1                          
carrot      1       1                  
fish        0       0      0           
grain       1       0      0    1      
insects     1       0      0    1     2

If you want it as a dataframe do:

df2 <- crossprod(table(cbind(doc_by_word_df[,1],unlist(doc_by_word_df[-1]))))
subset(data.frame(as.table(as.matrix(as.dist(df2)))), Freq > 0)
      Var1    Var2 Freq
2  bananna   apple    1
3   carrot   apple    1
5    grain   apple    1
6  insects   apple    1
7    apple bananna    1
9   carrot bananna    1
13   apple  carrot    1
14 bananna  carrot    1
23   grain    fish    1
24 insects    fish    1
25   apple   grain    1
28    fish   grain    1
30 insects   grain    2
31   apple insects    1
34    fish insects    1
35   grain insects    2

TechQA.

Is it possible to speed this function using dplyr and furrr up?

There are 1 answers

Related Questions in R

Related Questions in DPLYR

Related Questions in PURRR

Related Questions in FURRR

Popular Questions

Popular Tags

Trending Questions