R extract most common word(s) / ngrams in a column by group

Question

R extract most common word(s) / ngrams in a column by group

534 views Asked by Yeshyyy At 11 September 2020 at 03:35

I wish to extract main keywords from the column 'title', for each group (1st column).

Desired result in column 'desired title':

Reproducible data:

myData <- 
structure(list(group = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 
2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3), title = c("mentoring aug 8th 2018", 
"mentoring aug 9th 2017", "mentoring aug 9th 2018", "mentoring august 31", 
"mentoring blue care", "mentoring cara casual", "mentoring CDP", 
"mentoring cell douglas", "mentoring centurion", "mentoring CESO", 
"mentoring charlotte", "medication safety focus", "medication safety focus month", 
"medication safety for nurses 2017", "medication safety formulations errors", 
"medication safety foundations care", "medication safety general", 
"communication surgical safety", "communication tips", "communication tips for nurses", 
"communication under fire", "communication webinar", "communication welling", 
"communication wellness")), row.names = c(NA, -24L), class = c("tbl_df", 
"tbl", "data.frame"))

I've looked into record linkage solutions, but that's mainly for grouping the full titles. Any suggestions would be great.

Original Q&A

There are 1 answers

**Yeshyyy** · Accepted Answer · 2020-09-22T02:12:17+00:00

I concatenated all titles by group, and tokenized them:

library(dplyr)
myData <-
  topic_modelling %>% 
  group_by(group) %>% 
  mutate(titles = paste0(title, collapse = " ")) %>%
  select(group, titles) %>% 
  distinct()

myTokens <- myData %>% 
  unnest_tokens(word, titles) %>% 
  anti_join(stop_words, by = "word")
myTokens

Below is the resulting dataframe:

# finding top ngrams
library(textrank)

stats <- textrank_keywords(myTokens$word, ngram_max = 3, sep = " ")
stats <- subset(stats$keywords, ngram > 0 & freq >= 3)
head(stats, 5)

I'm happy with the result:

While applying the algorithm to my real data of about 100000 lines, I made a function to tackle the problem group by group:

# FUNCTION: TOP NGRAMS ----
find_top_ngrams <- function(titles_concatenated)
{
  myTest <-
    titles_concatenated %>%
    as_tibble() %>%
    unnest_tokens(word, value) %>%
    anti_join(stop_words, by = "word")
  
  stats <- textrank_keywords(myTest$word, ngram_max = 4, sep = " ")
  stats <- subset(stats$keywords, ngram > 1 & freq >= 5)
  top_ngrams <- head(stats, 5)
  
  top_ngrams <- tibble(top_ngrams)
  
  return(top_ngrams)
  
  # print(top_ngrams)
  
}


for (i in 1:5){
  find_top_ngrams(myData$titles[i])
}

TechQA.

R extract most common word(s) / ngrams in a column by group

There are 1 answers

Related Questions in TM

Related Questions in TOPIC-MODELING

Related Questions in N-GRAM

Related Questions in UDPIPE

Related Questions in TEXTRANK

Popular Questions

Popular Tags

Trending Questions