I'm working with a large 1M doc corpus and have applied several transformations when creating a document frequency matrix from it:
library(quanteda)
corpus_dfm <- dfm(tokens(corpus1M), # where corpus1M is already a corpus via quanteda::corpus()
remove = stopwords("english"),
#what = "word", #experimented if adding this made a difference
remove_punct = T,
remove_numbers = T,
remove_symbols = T,
ngrams = 1:2,
dictionary = lut_dict,
stem = TRUE)
Then to look at the resulting features:
dimnames(corpus_dfm)$features
[1] "abandon"
[2] "abandoned auto"
[3] "abandoned vehicl"
...
[8] "accident hit and run"
...
[60] "assault no weapon aggravated injuri"
Why are these features more than 1:2 bigrams long? Stemming appears to have been successful, but the tokens appear to be sentences and not words.
I tried adjusting my code to this: dfm(tokens(corpus1M, what = "word")
but there was no change.
I tried to make a tiny reproducible example:
library(tidyverse) # just for the pipe here
example_text <- c("the quick brown fox",
"I like carrots",
"the there that etc cats dogs") %>% corpus
Then if I apply the same dfm as above:
> dimnames(corpus_dfm)$features
[1] "etc."
This was surprising because nearly all word have been removed? Even stopwords unlike before, so I'm more confused! I'm also now not able to create a reproducible example despite just trying to. Maybe I've misunderstood how this function works?
How can I create a dfm in quanteda where there are only 1:2 word tokens and where stopwords are removed?
First question: Why are the feature (names) in the dfm so long?
Answer: Because the application of the dictionary in the
dfm()
call replaces the matches to your unigrams and bigram features with the dictionary keys, and (many of) the keys in your dictionary consist of multiple words. Example:Second question: In reproducible example, why are almost all words gone?
Answer: Because the only match of a dictionary value to the features in the dfm was to the "etc." category.
If you do not apply the dictionary, then you see:
If you want to keep the features not matched, then replace
dictionary
withthesaurus
. Below, you will see that the "etc" token has been replaced with the upper-cased key "ETC.":