using text2vec for multilabel classification

271 views Asked by At

I want to know if text2vec package can be used for multilabel classification like python's BinaryRelevance in skmultilearn.problem_transform I'm currently referring to the pipeline documented at: http://text2vec.org/vectorization.html

1

There are 1 answers

0
Sam S. On

You can use text2vec to create document-term-matrix (dtm). To create dtm, you can use http://text2vec.org/vectorization.html. When your dtm matrix is ready, you can use them for multi-label classification. For classification, xgboost model is one of the good models, which is discussed in https://rpubs.com/mharris/multiclass_xgboost.

library(xgboost)

# dtm_train is the training matrix obtained by text2vec  
# dtm_test is the testing matrix obtained by text2vec    
# label_train is labels for dtm_trian; should be factors
# label_train <- factor(label_train, labels = classes)

nclass <- 3  # how many classes you have
param       <- list("objective" = "multi:softmax", # multi class classification
               "num_class"= nclass ,          # Number of classes
               "eval_metric" = "mlogloss",    # evaluation metric 
               "nthread" = 8,                # number of threads to be used 
               "max_depth" = 16,             # maximum depth of tree 
               "eta" = 0.3,                  # step size shrinkage 
               "gamma" = 0,                  # minimum loss reduction 
               "subsample" = 0.7,            # part of data instances 
               "colsample_bytree" = 1,       # subsample ratio 
               "min_child_weight" = 12       # minimum sum of instance weight 
)

bst = xgboost(
 param=param,
 data =as.matrix(dtm_train),
 label = label_training,
 nrounds=200)

# Make prediction on the testing data.
pred <- predict(bst, as.matrix(dtm_test))

Hopefully helps.

Please let me know if you need further explanation.