Adding dataframe column with frequency counts for several pre-specified words in R

Question

Adding dataframe column with frequency counts for several pre-specified words in R

37 views Asked by lwe At 09 March 2023 at 15:39

I have a dataframe of thousands of news articles that looks like this:

id	text	date
1	newyorktimes leaders gather for the un summit in next week to discuss	1980-1-18
2	newyorktimes opinion section what the washingtonpost got wrong about	1980-1-22
3	a journalist for the washingtonpost went missing while on assignment	1980-1-22
4	washingtonpost president carter responds to criticisms on economic decline	1980-1-28
5	newyorktimes opinion section what needs to be down with about the rats	1980-1-29

I want to produce an additional column that has the combined counts for several specific words in the articles themselves. Let's say I want to know how many times "newyorktimes", "washingtonpost", and "the" appear in each article. I would want a separate column added to the dataframe adding the counts for that row. Like this:

id	text	date	wordlistcount
1	newyorktimes leaders gather for the un summit in next week to discuss	1980-1-18	2
2	newyorktimes opinion section what the washingtonpost and newyorktimes got wrong	1980-1-22	4
3	a journalist for the washingtonpost went missing while on assignment	1980-1-22	2
4	washingtonpost president carter responds to criticisms on economic decline	1980-1-28	1
4	newyorktimes opinion section what needs to be done with about the rats	1980-1-29	2

How can I accomplish this? Any help would be greatly appreciated.

Original Q&A

There are 2 answers

DPH On 09 March 2023 at 15:53

the search for regex can be a bit tricky. In your case "the" is a word but also can be part of other words (like "gather" in the first line of your dummy data). So to be sure you only do count the individual word you can search for "the", while informing that what comes after and before, is anything but a letter.

library(dplyr)


mydf <- data.table::fread("id   text    date
    1   newyorktimes leaders gather for the un summit in next week to discuss   1980-1-18
    2   newyorktimes opinion section what the washingtonpost and newyorktimes got wrong     1980-1-22
    3   a journalist for the washingtonpost went missing while on assignment    1980-1-22
    4   washingtonpost president carter responds to criticisms on economic decline  1980-1-28
    5   newyorktimes opinion section what needs to be down with about the rats  1980-1-29")

# vector of search words where [^\\p{L}] is anything but a letter from any alphabet
search_vec <- c("newyorktimes","washingtonpost","[^\\p{L}]the[^\\p{L}]") 

mydf %>% 
    dplyr::mutate(wordlistcount = stringr::str_count(text, pattern = paste(search_vec, collapse = "|")))

   id                                                                            text       date wordlistcount
1:  1           newyorktimes leaders gather for the un summit in next week to discuss 1980-01-18             2
2:  2 newyorktimes opinion section what the washingtonpost and newyorktimes got wrong 1980-01-22             4
3:  3            a journalist for the washingtonpost went missing while on assignment 1980-01-22             2
4:  4      washingtonpost president carter responds to criticisms on economic decline 1980-01-28             1
5:  5          newyorktimes opinion section what needs to be down with about the rats 1980-01-29             2

You data looks OK but I will point out anyways, that depending on your usecase you might want to convert all text to lower case before or inside the str_count function. This will ensure that diference in upper and lower case do not interfere with the string matching (i.e. "the" != "The")... converting all text to upper and writing the search words in uppercase is the equivalent.

**Maël** · Accepted Answer · 2023-03-09T15:42:58+00:00

In stringr, with str_count:

library(stringr)
library(dplyr)
words = c("newyorktimes", "washingtonpost", "the")
df %>% 
  mutate(wordlistcount = str_count(text, str_c("\\b", words, "\\b", collapse = "|")))




#   id                                                                       text      date wordlistcount
# 1  1      newyorktimes leaders gather for the un summit in next week to discuss 1980-1-18             2
# 2  2       newyorktimes opinion section what the washingtonpost got wrong about 1980-1-22             3
# 3  3       a journalist for the washingtonpost went missing while on assignment 1980-1-22             2
# 4  4 washingtonpost president carter responds to criticisms on economic decline 1980-1-28             1
# 5  5     newyorktimes opinion section what needs to be down with about the rats 1980-1-29             2

TechQA.

Adding dataframe column with frequency counts for several pre-specified words in R

There are 2 answers

Related Questions in R

Related Questions in DATAFRAME

Related Questions in TEXT

Related Questions in WORD-FREQUENCY

Popular Questions

Trending Questions