I used Quanteda
package to create dfm and dfm-tfidf objects. I followed two ways to remove sparse features and create trimmed dfms. One was by directly applying sparsity
argument on the dfm()
function. The second was by reducing sparsity using dfm_trim()
.
Approach 1: I first created the dfm and dfm_tfidf objects from the train and test tokens. Then I applied dfm_tfidf as follows.
dfmat_train <- dfm(traintokens)
dfmat_train_tfidf<- dfm_tfidf(dfmat_train)
dfmat_test <- dfm(traintokens)
dfmat_test_tfidf <- dfm(dfmat_test)
Then, I simply used dfm_trim
to remove sparse features.
dfmat_train_trimmed <- dfm_trim(dfmat_train, sparsity=0.98)
dfmat_train_trimmed _tfidf <- dfm_trim(dfmat_train_tfidf, sparsity=0.98)
dfmat_test_trimmed <- dfm_trim(dfmat_test, sparsity=0.98)
dfmat_test_trimmed_tfidf <- dfm_trim(dfmat_test_tfidf, sparsity=0.98)
Approach 2 was shorter. The tfdif weighting is done after trimming.
dfmat_train <- dfm(traintokens, sparsity = 0.98)
dfmat_train_tfidf <- dfm_tfidf(dfmat_train)
dfmat_test <- dfm_tfidf(testtokens, sparsity = 0.98)
dfmat_test_tfidf <- dfm_tfidf(dfmat_test )
After training models using both of the above approaches and predicting the test data sets, Approach 1 resulted in identical prediction performance metrics for both tfidf and non-tfidf test data. Cohen's Kappa is 1. Approach 2 resulted in different (tfidf and non-tfidf) but less accurate predictions. I am puzzled. Which one is the right approach?