LDA cross validation and variable selection

1.2k views Asked by At

I have a data frame with 395 observations and 36 variables. I am doing cross validation to select the best few variables to classify the student qualifications. I have written this code:

k<-5
error <- c()
for(l in 1:35){
  if(l!=31 && l!=32 && l!=33){
    x<-0
    for (i in 1:k){
      train<-rep(TRUE, dim(student.mat)[1])
      for(j in 1:dim(student.mat)[1]/k){
        train[(i-1)*dim(student.mat)[1]/k+j]<-FALSE
      }
      test=!train
      student.test=student.mat[test,]
      student.train=student.mat[train,]
      nota3.test=nota3[test]
      lda.fit<-lda(nota3~student.mat[,i], data=student.mat, subset=train)
      lda.pred<-predict(lda.fit, student.test)
      table(lda.pred$class, nota3.test)
      y<-mean(lda.pred$class!=nota3.test)
      x<-x+y
      #cat("k = ", i, "error: ", y*100,"%", "\n")
    }
    #cat("Media del error = ", x/k*100,"%", "\n")
    error <- c(error, x/k)
  }else{
    error <- c(error, 100)
  }
}
error
names(student.mat)[which.max(error)]

and I get this error:

Error in table(lda.pred$class, nota3.test) : all arguments must have the same length Also: lost warning messages 'newdata' had 79 rows but variables found have 395 rows

but if I write the name of one variable of the data set instead of student.mat[,i], it works. The lda function don't read student.mat[,i] correctly.

1

There are 1 answers

1
josliber On

You can create the formula programmatically:

lda.fit<-lda(paste0("nota3~", names(student.mat)[i]), data=student.mat, subset=train)