Why does featnames(myDFM) contain features of more than one or two tokens?

141 views Asked by At

I'm working with a large 1M doc corpus and have applied several transformations when creating a document frequency matrix from it:

library(quanteda)
corpus_dfm <- dfm(tokens(corpus1M), # where corpus1M is already a corpus via quanteda::corpus()
                  remove = stopwords("english"),
                  #what = "word", #experimented if adding this made a difference
                  remove_punct = T,
                  remove_numbers = T,
                  remove_symbols = T,
                  ngrams = 1:2,
                  dictionary = lut_dict,
                  stem = TRUE)

Then to look at the resulting features:

dimnames(corpus_dfm)$features
[1] "abandon"                                      
[2] "abandoned auto"                               
[3] "abandoned vehicl"
...
[8] "accident hit and run"
...
[60] "assault no weapon aggravated injuri" 

Why are these features more than 1:2 bigrams long? Stemming appears to have been successful, but the tokens appear to be sentences and not words.

I tried adjusting my code to this: dfm(tokens(corpus1M, what = "word") but there was no change.

I tried to make a tiny reproducible example:

library(tidyverse) # just for the pipe here
example_text <- c("the quick brown fox",
                  "I like carrots",
                  "the there that etc cats dogs") %>% corpus

Then if I apply the same dfm as above:

> dimnames(corpus_dfm)$features
[1] "etc."

This was surprising because nearly all word have been removed? Even stopwords unlike before, so I'm more confused! I'm also now not able to create a reproducible example despite just trying to. Maybe I've misunderstood how this function works?

How can I create a dfm in quanteda where there are only 1:2 word tokens and where stopwords are removed?

1

There are 1 answers

0
Ken Benoit On BEST ANSWER

First question: Why are the feature (names) in the dfm so long?

Answer: Because the application of the dictionary in the dfm() call replaces the matches to your unigrams and bigram features with the dictionary keys, and (many of) the keys in your dictionary consist of multiple words. Example:

lut_dict[70:72]
# Dictionary object with 3 key entries.
# - assault felony:
#     - asf
# - assault misdemeanor:
#     - asm
# - assault no weapon aggravated injury:
#     - anai

Second question: In reproducible example, why are almost all words gone?

Answer: Because the only match of a dictionary value to the features in the dfm was to the "etc." category.

corpus_dfm2 <- dfm(tokens(example_text), # where corpus1M is already a corpus via quanteda::corpus()
                  remove = stopwords("english"),
                  remove_punct = TRUE,
                  remove_numbers = TRUE,
                  remove_symbols = TRUE,
                  dictionary = lut_dict,
                  ngrams = 1:2,
                  stem = TRUE, verbose = TRUE)
corpus_dfm2
# Document-feature matrix of: 3 documents, 1 feature (66.7% sparse).
# 3 x 1 sparse Matrix of class "dfmSparse"
#        features
# docs    etc.
#   text1    0
#   text2    0
#   text3    1

lut_dict["etc."]
# Dictionary object with 1 key entry.
# - etc.:
#     - etc

If you do not apply the dictionary, then you see:

dfm(tokens(example_text),   # the "tokens" is not necessary here
    remove = stopwords("english"),
    remove_punct = TRUE,
    remove_numbers = TRUE,
    remove_symbols = TRUE,
    ngrams = 1:2,
    stem = TRUE)
# Document-feature matrix of: 3 documents, 18 features (66.7% sparse).
# 3 x 18 sparse Matrix of class "dfmSparse"
#        features
# docs    quick brown fox the_quick quick_brown brown_fox like carrot i_like
#   text1     1     1   1         1           1         1    0      0      0
#   text2     0     0   0         0           0         0    1      1      1
#   text3     0     0   0         0           0         0    0      0      0
#        features
# docs    like_carrot etc cat dog the_there there_that that_etc etc_cat cat_dog
#   text1           0   0   0   0         0          0        0       0       0
#   text2           1   0   0   0         0          0        0       0       0
#   text3           0   1   1   1         1          1        1       1       1

If you want to keep the features not matched, then replace dictionary with thesaurus. Below, you will see that the "etc" token has been replaced with the upper-cased key "ETC.":

dfm(tokens(example_text), 
    remove = stopwords("english"),
    remove_punct = TRUE,
    remove_numbers = TRUE,
    remove_symbols = TRUE,
    thesaurus = lut_dict,
    ngrams = 1:2,
    stem = TRUE)
Document-feature matrix of: 3 documents, 18 features (66.7% sparse).
3 x 18 sparse Matrix of class "dfmSparse"
       features
docs    quick brown fox the_quick quick_brown brown_fox like carrot i_like
  text1     1     1   1         1           1         1    0      0      0
  text2     0     0   0         0           0         0    1      1      1
  text3     0     0   0         0           0         0    0      0      0
       features
docs    like_carrot cat dog the_there there_that that_etc etc_cat cat_dog ETC.
  text1           0   0   0         0          0        0       0       0    0
  text2           1   0   0         0          0        0       0       0    0
  text3           0   1   1         1          1        1       1       1    1