Adding dataframe column with frequency counts for several pre-specified words in R

37 views Asked by At

I have a dataframe of thousands of news articles that looks like this:

id text date
1 newyorktimes leaders gather for the un summit in next week to discuss 1980-1-18
2 newyorktimes opinion section what the washingtonpost got wrong about 1980-1-22
3 a journalist for the washingtonpost went missing while on assignment 1980-1-22
4 washingtonpost president carter responds to criticisms on economic decline 1980-1-28
5 newyorktimes opinion section what needs to be down with about the rats 1980-1-29

I want to produce an additional column that has the combined counts for several specific words in the articles themselves. Let's say I want to know how many times "newyorktimes", "washingtonpost", and "the" appear in each article. I would want a separate column added to the dataframe adding the counts for that row. Like this:

id text date wordlistcount
1 newyorktimes leaders gather for the un summit in next week to discuss 1980-1-18 2
2 newyorktimes opinion section what the washingtonpost and newyorktimes got wrong 1980-1-22 4
3 a journalist for the washingtonpost went missing while on assignment 1980-1-22 2
4 washingtonpost president carter responds to criticisms on economic decline 1980-1-28 1
4 newyorktimes opinion section what needs to be done with about the rats 1980-1-29 2

How can I accomplish this? Any help would be greatly appreciated.

2

There are 2 answers

1
Maël On BEST ANSWER

In stringr, with str_count:

library(stringr)
library(dplyr)
words = c("newyorktimes", "washingtonpost", "the")
df %>% 
  mutate(wordlistcount = str_count(text, str_c("\\b", words, "\\b", collapse = "|")))




#   id                                                                       text      date wordlistcount
# 1  1      newyorktimes leaders gather for the un summit in next week to discuss 1980-1-18             2
# 2  2       newyorktimes opinion section what the washingtonpost got wrong about 1980-1-22             3
# 3  3       a journalist for the washingtonpost went missing while on assignment 1980-1-22             2
# 4  4 washingtonpost president carter responds to criticisms on economic decline 1980-1-28             1
# 5  5     newyorktimes opinion section what needs to be down with about the rats 1980-1-29             2
0
DPH On

the search for regex can be a bit tricky. In your case "the" is a word but also can be part of other words (like "gather" in the first line of your dummy data). So to be sure you only do count the individual word you can search for "the", while informing that what comes after and before, is anything but a letter.

library(dplyr)


mydf <- data.table::fread("id   text    date
    1   newyorktimes leaders gather for the un summit in next week to discuss   1980-1-18
    2   newyorktimes opinion section what the washingtonpost and newyorktimes got wrong     1980-1-22
    3   a journalist for the washingtonpost went missing while on assignment    1980-1-22
    4   washingtonpost president carter responds to criticisms on economic decline  1980-1-28
    5   newyorktimes opinion section what needs to be down with about the rats  1980-1-29")

# vector of search words where [^\\p{L}] is anything but a letter from any alphabet
search_vec <- c("newyorktimes","washingtonpost","[^\\p{L}]the[^\\p{L}]") 

mydf %>% 
    dplyr::mutate(wordlistcount = stringr::str_count(text, pattern = paste(search_vec, collapse = "|")))

   id                                                                            text       date wordlistcount
1:  1           newyorktimes leaders gather for the un summit in next week to discuss 1980-01-18             2
2:  2 newyorktimes opinion section what the washingtonpost and newyorktimes got wrong 1980-01-22             4
3:  3            a journalist for the washingtonpost went missing while on assignment 1980-01-22             2
4:  4      washingtonpost president carter responds to criticisms on economic decline 1980-01-28             1
5:  5          newyorktimes opinion section what needs to be down with about the rats 1980-01-29             2

You data looks OK but I will point out anyways, that depending on your usecase you might want to convert all text to lower case before or inside the str_count function. This will ensure that diference in upper and lower case do not interfere with the string matching (i.e. "the" != "The")... converting all text to upper and writing the search words in uppercase is the equivalent.