I am doing textual analysis in R. So far, I've always used a "simple bag of words" approach, meaning that I was looking for a wordlist of single words in a large Corpus of different text.
I've been using this approach:
library(tm)
library(quanteda)
## prepare reprex, create tm VCorpus:
docs <- c("I like going on holiday in january and not in february.",
"I went on a holiday in march.",
"I like going on vacation.")
x <- VCorpus(VectorSource(docs))
class(x)
#> [1] "VCorpus" "Corpus"
### tm VCorpus object to Quanteda corpus:
x <- corpus(x)
class(x)
#> [1] "corpus" "character"
### continue with tokenization and stemmming
toks <- tokens(x)
toks <- tokens_wordstem(toks)
dtm <- dfm(toks)
# dictionary() takes a named list, i.e. list(months = c(..))
# and "january", "february" are stemmed to "januari", "februari"
dict1 <- dictionary(list(months = c("januar*", "februar*", "march")))
dict_dtm2 <- dfm_lookup(dtm, dict1, nomatch="_unmatched")
dict_dtm2
However, no I want to change things up: I want to look for a wordlist, that contains a) single words, b) multiple words as an expression.
The wordlist now looks like this:
dictionary_2 <- dictionary(list(wordlist = c("januar*", "not in februar*", "holiday in march", "on vacation")
I am stuck with this new option, since I can no longer use document-term-matrix, since they don't keep the word-chunks together.. Any idea of how I can do it?