Finding word frequency of wordlist with multiple word-chunks

25 views Asked by At

I am doing textual analysis in R. So far, I've always used a "simple bag of words" approach, meaning that I was looking for a wordlist of single words in a large Corpus of different text.

I've been using this approach:

library(tm)
library(quanteda)

## prepare reprex, create tm VCorpus:
docs <- c("I like going on holiday in january and not in february.",
          "I went on a holiday in march.",
          "I like going on vacation.")
x <- VCorpus(VectorSource(docs))
class(x)
#> [1] "VCorpus" "Corpus"

### tm VCorpus object to Quanteda corpus:
x <- corpus(x)
class(x)
#> [1] "corpus"    "character"

### continue with tokenization and stemmming
toks <- tokens(x) 
toks <- tokens_wordstem(toks) 
dtm <- dfm(toks)

# dictionary() takes a named list, i.e. list(months = c(..))
# and "january", "february" are stemmed to "januari", "februari"
dict1 <- dictionary(list(months = c("januar*", "februar*", "march")))
dict_dtm2 <- dfm_lookup(dtm, dict1, nomatch="_unmatched")                                 
dict_dtm2

However, no I want to change things up: I want to look for a wordlist, that contains a) single words, b) multiple words as an expression.

The wordlist now looks like this:

 dictionary_2 <- dictionary(list(wordlist = c("januar*", "not in februar*", "holiday in march", "on vacation")

I am stuck with this new option, since I can no longer use document-term-matrix, since they don't keep the word-chunks together.. Any idea of how I can do it?

0

There are 0 answers