I have a text classification problem with a text feature 'DESCRIPTION' and output variable 'TYPE', which has three different values.
I perform the preprocessing steps and then run the model but I get the following error when I run the confusion matrix (at the end of the code).
Error in confusionMatrix.default(clothing_reviews_test$TYPE, as.factor(pred2)) : the data cannot have more levels than the reference
What is the problem in the code which cause this error?
Code is below
d=read.csv("SONAR_RULES.csv", stringsAsFactors = TRUE)
d$DESCRIPTION= as.character(d$DESCRIPTION)
d= d[, !names(d) %in% c("REMEDIATION_GAP_MULT", "REMEDIATION_FUNCTION", "REMEDIATION_BASE_EFFORT") ]
d$REMEDIATION_FUNCTION=NULL
d$DEF_REMEDIATION_GAP_MULT=NULL
d$REMEDIATION_BASE_EFFORT=NULL
glimpse(d)
p= function(x) {sum(is.na(x))/length(x)*100} ## checking missing values % in columns
apply(clothing_reviews_train, 2, p) ## checking missing values % in columns
head(d)
set.seed(42)
idx <- createDataPartition(d$TYPE,
p = 0.7,
list = FALSE,
times = 1)
clothing_reviews_train <- d[ idx,]
clothing_reviews_test <- d[-idx,]
stem_tokenizer <- function(x) {
lapply(word_tokenizer(x),
SnowballC::wordStem,
language = "en")
}
stop_words = tm::stopwords(kind = "en")
# create prunded vocabulary
vocab_train <- itoken(clothing_reviews_train$DESCRIPTION,
preprocess_function = tolower,
tokenizer = stem_tokenizer,
progressbar = FALSE)
v <- create_vocabulary(vocab_train,
stopwords = stop_words)
pruned_vocab <- prune_vocabulary(v,
doc_proportion_max = 0.99,
doc_proportion_min = 0.01)
vectorizer_train <- vocab_vectorizer(pruned_vocab)
# preprocessing function
create_dtm_mat <- function(text, vectorizer = vectorizer_train) {
vocab <- itoken(text,
preprocess_function = tolower,
tokenizer = stem_tokenizer,
progressbar = FALSE)
dtm <- create_dtm(vocab,
vectorizer = vectorizer)
tfidf = TfIdf$new()
fit_transform(dtm, tfidf)
}
dtm_train2 <- create_dtm_mat(clothing_reviews_train$DESCRIPTION)
dtm_test2 <- create_dtm_mat(clothing_reviews_test$DESCRIPTION)
str(dtm_train2)
xgb_model2 <- xgb.train(params = list(max_depth = 10,
eta = 0.2,
objective = "binary:logistic",
eval_metric = "error", nthread = 1),
data = xgb.DMatrix(as.matrix(dtm_train2),
label = clothing_reviews_train$TYPE == "1"),
nrounds = 500)
pred2 <- predict(xgb_model2, dtm_test2)
confusionMatrix(clothing_reviews_test$TYPE,
as.factor(pred2))
Glimpse of data
glimpse(d)
Rows: 1,819
Columns: 14
$ PLUGIN_RULE_KEY <fct> InsufficientBranchCoverage, InsufficientLineCo~
$ PLUGIN_CONFIG_KEY <fct> , , , , , , , , , , S1120, , , , StringEqualit~
$ PLUGIN_NAME <fct> common-java, common-java, common-java, common-~
$ DESCRIPTION <chr> "An issue is created on a file as soon as the ~
$ SEVERITY <fct> MAJOR, MAJOR, MAJOR, MAJOR, MAJOR, MAJOR, MINO~
$ NAME <fct> "Branches should have sufficient coverage by t~
$ DEF_REMEDIATION_FUNCTION <fct> LINEAR, LINEAR, LINEAR, LINEAR_OFFSET, LINEAR,~
$ DEF_REMEDIATION_GAP_MULT <fct> 5min, 2min, 2min, 10min, 10min, 10min, , , , ,~
$ DEF_REMEDIATION_BASE_EFFORT <fct> , , , 10min, , , 5min, 5min, 15min, 15min, 1mi~
$ GAP_DESCRIPTION <fct> "number of uncovered conditions", "number of l~
$ SYSTEM_TAGS <fct> "bad-practice", "bad-practice", "convention", ~
$ IS_TEMPLATE <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ DESCRIPTION_FORMAT <fct> HTML, HTML, HTML, HTML, HTML, HTML, HTML, HTML~
$ TYPE <fct> CODE_SMELL, CODE_SMELL, CODE_SMELL, CODE_SMELL~