Should I trim dfm before or after applying tfidf?

342 views Asked by At

I used Quanteda package to create dfm and dfm-tfidf objects. I followed two ways to remove sparse features and create trimmed dfms. One was by directly applying sparsity argument on the dfm() function. The second was by reducing sparsity using dfm_trim().

Approach 1: I first created the dfm and dfm_tfidf objects from the train and test tokens. Then I applied dfm_tfidf as follows.

dfmat_train <- dfm(traintokens)
dfmat_train_tfidf<- dfm_tfidf(dfmat_train)
dfmat_test <- dfm(traintokens)
dfmat_test_tfidf <- dfm(dfmat_test)

Then, I simply used dfm_trim to remove sparse features.

dfmat_train_trimmed <- dfm_trim(dfmat_train, sparsity=0.98)
dfmat_train_trimmed _tfidf <- dfm_trim(dfmat_train_tfidf, sparsity=0.98)
dfmat_test_trimmed <- dfm_trim(dfmat_test, sparsity=0.98)
dfmat_test_trimmed_tfidf <- dfm_trim(dfmat_test_tfidf, sparsity=0.98)

Approach 2 was shorter. The tfdif weighting is done after trimming.

dfmat_train <- dfm(traintokens, sparsity = 0.98)
dfmat_train_tfidf <- dfm_tfidf(dfmat_train)
dfmat_test <- dfm_tfidf(testtokens, sparsity = 0.98)
dfmat_test_tfidf <- dfm_tfidf(dfmat_test )

After training models using both of the above approaches and predicting the test data sets, Approach 1 resulted in identical prediction performance metrics for both tfidf and non-tfidf test data. Cohen's Kappa is 1. Approach 2 resulted in different (tfidf and non-tfidf) but less accurate predictions. I am puzzled. Which one is the right approach?

0

There are 0 answers