I've been playing around with implementing CV in R but encountered a weird problem with the returned value among folds in LOOCV.
First I'll randomly generate data as well as labels, then I'll fit a randomForest on what should be just noise. From the returned loop I get not only a good AUC but a significant p-value from a t-test. I don't understand how this could be theoretically happening so I was curious if the ways I attempted to generate data/labels was best?
Here is a code snippet that shows my issue.
library(randomForest)
library(pROC)
n=30
p=900
set.seed(3)
XX=matrix(rnorm(n*p, 0, 1) , nrow=n)
YY=as.factor(sample(c('P', 'C'), n, replace=T))
resp = vector()
for(i in 1:n){
fit = randomForest(XX[-i,], YY[-i])
pred = predict(fit, XX[i,], type = "prob")[2]
resp[i] <- pred
}
t.test(resp~YY)$p.value
roc(YY, resp)$auc
I tried multiple ways of generating data all of which result in the same thing
XX=matrix(runif(n*p), nrow=n)
XX=matrix(rnorm(n*p, 0, 1) , nrow=n)
and
random_data=matrix(0, n, p)
for(i in 1:n){
random_data[i,]=jitter(runif(p), factor = 1, amount = 10)
}
XX=as.matrix(random_data)
Since the randomForest is finding relevant predictors in this scenario that leads me to believe that data may not be truly random. Is there a better possible way I could generate data, or generate the random labels? is it possible that this is an issue with R?
This is a partial answer: I modified your roc function call to make sure the distribution of AUC values are between 0 and 1. Then I ran it 20 times. Mean AUC and p-value are 0.73 and 0.12, respectively. Improved but still better than random...