I have the following data set( sample of 1st 10 rows given)
structure(list(variableA = c(11L, 7L, 17L, 7L, 7L, 2L,
2L, 7L, 7L, 4L), variableB = c(10L, 20L, 4L, 0L, 0L, 1L,
1L, 0L, 0L, 2L), variableC = c(284L,
43L, 19L, 0L, 0L, 27L, 27L, 0L, 0L, 20L), variableD = c(299L,
24L, 28L, 167L, 167L, 27L, 27L, 194L, 194L, 21L), variableE = c(2,
1, 1, 1, 1, 1, 1, 1, 1, 1), variableF1 = c(0L, 0L,
0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L), variableF2 = c(0L,
0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L), variableF3 = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), variableF4 = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), variableF5 = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), variableF6 = c(1L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), variableF7 = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), variableF8 = c(0L,
1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), variableF9 = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), variableF10 = c(0L,
0L, 1L, 0L, 0L, 1L, 1L, 0L, 0L, 0L), variableG1 = c(1L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), variableG2 = c(0L,
0L, 1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L), variableG3 = c(0L,
1L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L), clusters = structure(c(3L,
3L, 3L, 3L, 3L, 3L, 3L, 1L, 6L, 6L), .Label = c("1", "2", "3",
"4", "5", "6"), class = "factor"), out = structure(c(1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 6L, 6L), .Label = c("3", "1", "2", "4",
"5", "6"), class = "factor")), row.names = c(1L, 3L, 4L, 5L,
6L, 8L, 9L, 12L, 13L, 14L), class = "data.frame")
i have been trying to use the suppport vector machine algorithm on this data set, earlier it was working well now for some reason its giving the error.
model i am trying is
set.seed(111)
trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
svm_Linear <- train(out~`variableA` + `variableB` +
`variableC` +`variableD`+
`variableE` +`variableF1` +
`variableF2` + `variableF3` +
`variableF4` + `variableF5` +
`variableF6` + `variableF7` +
`variableF8` + `variableF9` +
`variableF10` + `variableG1` +
`variableG2` + `variableG3` , data= train, method = "svmLinear",
trControl=trctrl,
preProcess = c("center", "scale"),
tuneLength = 10)
svm_Linear
But I am getting this error which I am not able to understand.
Error: One or more factor levels in the outcome has no data: '2'
I saw a similar post on this site but none has the answer I required
Your
out
column is a factor with 6 levels, but only 3 are represented in thedput
you provided in your post - that's why you're getting this error.This is probably due to the way you performed your train/test split .
You can redefine
levels(out)
to include onlyc(1, 3, 6)
, but this will be a problem if your test data contains the other response levels.Consider using a stratified sampling approach instead, to ensure your response variable is correctly represented across a train/test split. Questions about stratified sampling would be more appropriate for Cross Validated than for Stack Overflow, but there are some good starting points mentioned in this SO post and this one.