R extract most common word(s) / ngrams in a column by group

532 views Asked by At

I wish to extract main keywords from the column 'title', for each group (1st column).

data

Desired result in column 'desired title':

desired

Reproducible data:

myData <- 
structure(list(group = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 
2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3), title = c("mentoring aug 8th 2018", 
"mentoring aug 9th 2017", "mentoring aug 9th 2018", "mentoring august 31", 
"mentoring blue care", "mentoring cara casual", "mentoring CDP", 
"mentoring cell douglas", "mentoring centurion", "mentoring CESO", 
"mentoring charlotte", "medication safety focus", "medication safety focus month", 
"medication safety for nurses 2017", "medication safety formulations errors", 
"medication safety foundations care", "medication safety general", 
"communication surgical safety", "communication tips", "communication tips for nurses", 
"communication under fire", "communication webinar", "communication welling", 
"communication wellness")), row.names = c(NA, -24L), class = c("tbl_df", 
"tbl", "data.frame"))

I've looked into record linkage solutions, but that's mainly for grouping the full titles. Any suggestions would be great.

1

There are 1 answers

1
Yeshyyy On BEST ANSWER

I concatenated all titles by group, and tokenized them:

library(dplyr)
myData <-
  topic_modelling %>% 
  group_by(group) %>% 
  mutate(titles = paste0(title, collapse = " ")) %>%
  select(group, titles) %>% 
  distinct()

myTokens <- myData %>% 
  unnest_tokens(word, titles) %>% 
  anti_join(stop_words, by = "word")
myTokens

Below is the resulting dataframe: tokens

# finding top ngrams
library(textrank)

stats <- textrank_keywords(myTokens$word, ngram_max = 3, sep = " ")
stats <- subset(stats$keywords, ngram > 0 & freq >= 3)
head(stats, 5)

I'm happy with the result: res

While applying the algorithm to my real data of about 100000 lines, I made a function to tackle the problem group by group:

# FUNCTION: TOP NGRAMS ----
find_top_ngrams <- function(titles_concatenated)
{
  myTest <-
    titles_concatenated %>%
    as_tibble() %>%
    unnest_tokens(word, value) %>%
    anti_join(stop_words, by = "word")
  
  stats <- textrank_keywords(myTest$word, ngram_max = 4, sep = " ")
  stats <- subset(stats$keywords, ngram > 1 & freq >= 5)
  top_ngrams <- head(stats, 5)
  
  top_ngrams <- tibble(top_ngrams)
  
  return(top_ngrams)
  
  # print(top_ngrams)
  
}


for (i in 1:5){
  find_top_ngrams(myData$titles[i])
}