I have a data frame where rows correspond to documents and columns capture individual words in those documents.
library(tidyverse)
library(furrr)
#> Loading required package: future
doc_by_word_df <- structure(list(
doc_id = c("doc1.txt", "doc2.txt", "doc3.txt"),
kwpe_1 = c("apple", "fish", "apple"),
kwpe_2 = c("bananna", "grain", "insects"),
kwpe_3 = c("carrot", "insects", "grain")),
class = c("tbl_df", "tbl", "data.frame"),
row.names = c(NA,-3L))
doc_by_word_df
#> # A tibble: 3 × 4
#> doc_id kwpe_1 kwpe_2 kwpe_3
#> <chr> <chr> <chr> <chr>
#> 1 doc1.txt apple bananna carrot
#> 2 doc2.txt fish grain insects
#> 3 doc3.txt apple insects grain
I would like to identify all the documents containing any of the possible pair combinations of words in those documents.
To do that I have created a vector of all the words in the data set and extracted all the unique word-pair combinations.
all_words <- c("apple", "fish", "apple", "bananna", "grain", "insects", "carrot", "insects", "grain")
unique_keyword_pair <- combn(unique(all_words), 2)
unique_keyword_pair
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
#> [1,] "apple" "apple" "apple" "apple" "apple" "fish" "fish" "fish"
#> [2,] "fish" "bananna" "grain" "insects" "carrot" "bananna" "grain" "insects"
#> [,9] [,10] [,11] [,12] [,13] [,14] [,15]
#> [1,] "fish" "bananna" "bananna" "bananna" "grain" "grain" "insects"
#> [2,] "carrot" "grain" "insects" "carrot" "insects" "carrot" "carrot"
I've made a function that uses the unique word-pairs to filter out all the documents containing those word-pairs and mapped that function over the data frame.
It works in the way I'd like it to, but it takes a very long time to run. I have used the furrr
package to try to speed this up, but I'm still left with a very long run time. Originally I did this with a for loop; I was under the impression that using the a map function would shorten things--but I don't think it made much difference.
I do not know enough about this sort of thing to sort out what I can do to reduce the length of time it takes to run this function. I suspect it has to do with the massive number of word-pair combinations being run through the filter function, but beyond that I'm unsure.
Any suggestions would be appreciated.
docs_word_pairs <- function(x) {
doc_by_word_df %>%
filter(if_any(-doc_id, ~ . %in% unique_keyword_pair[,x][1]) &
if_any(-doc_id, ~ . %in% unique_keyword_pair[,x][2])) %>%
mutate(keyword_pair = paste(c(unique_keyword_pair[,x][1],
unique_keyword_pair[,x][2]),
collapse = "-"),
keyword_1 = unique_keyword_pair[,x][1],
keyword_2 = unique_keyword_pair[,x][2]) %>%
relocate(keyword_pair:keyword_2, .before = doc_id) %>%
group_by(keyword_pair) %>%
summarize(n = n())
}
num_unique_keyword_pair <- length(unique_keyword_pair)/2
seq_num_unique_keyword_pair <- rep(c(1:num_unique_keyword_pair))
future::plan(multisession)
seq_num_unique_keyword_pair %>%
future_map_dfr(docs_word_pairs)
#> # A tibble: 8 × 2
#> keyword_pair n
#> <chr> <int>
#> 1 apple-bananna 1 # one document contains this key word pair
#> 2 apple-grain 1
#> 3 apple-insects 1
#> 4 apple-carrot 1
#> 5 fish-grain 1
#> 6 fish-insects 1
#> 7 bananna-carrot 1
#> 8 grain-insects 2 # two documents contain this key word pair
Created on 2022-04-18 by the reprex package (v2.0.1)
This can quicly be done as shown below:
or even
If you want it as a dataframe do: