Random data generation leading to good prediction on random labels

262 views Asked by At

I've been playing around with implementing CV in R but encountered a weird problem with the returned value among folds in LOOCV.

First I'll randomly generate data as well as labels, then I'll fit a randomForest on what should be just noise. From the returned loop I get not only a good AUC but a significant p-value from a t-test. I don't understand how this could be theoretically happening so I was curious if the ways I attempted to generate data/labels was best?

Here is a code snippet that shows my issue.

library(randomForest)
library(pROC)
n=30
p=900

set.seed(3)
XX=matrix(rnorm(n*p, 0, 1) , nrow=n)
YY=as.factor(sample(c('P', 'C'), n, replace=T))
resp = vector()

for(i in 1:n){
  fit = randomForest(XX[-i,], YY[-i])
  pred = predict(fit, XX[i,], type = "prob")[2]
  resp[i] <- pred
}

t.test(resp~YY)$p.value

roc(YY, resp)$auc

I tried multiple ways of generating data all of which result in the same thing

XX=matrix(runif(n*p), nrow=n)
XX=matrix(rnorm(n*p, 0, 1) , nrow=n)

and

random_data=matrix(0, n, p)
for(i in 1:n){
  random_data[i,]=jitter(runif(p), factor = 1, amount = 10)
}
XX=as.matrix(random_data)

Since the randomForest is finding relevant predictors in this scenario that leads me to believe that data may not be truly random. Is there a better possible way I could generate data, or generate the random labels? is it possible that this is an issue with R?

1

There are 1 answers

0
user2605553 On

This is a partial answer: I modified your roc function call to make sure the distribution of AUC values are between 0 and 1. Then I ran it 20 times. Mean AUC and p-value are 0.73 and 0.12, respectively. Improved but still better than random...

library(ROCR)
library(randomForest)
library(pROC)
n=30
p=900

pvs=vector()
aucs=vector()
for (j in seq(20)){
    XX=matrix(rnorm(n*p, 0, 1) , nrow=n)
    YY=as.factor(sample(c('C', 'P'), n, replace=T))
    resp = vector()
    for(i in 1:n){
        fit = randomForest(XX[-i,], YY[-i])
        pred = predict(fit, XX[i,], type = "prob")[2]
        resp[i] <- pred
    }  
    pvs[j]=t.test(resp~YY)$p.value
    aucs[j]=roc(YY, resp, direction='>')$auc
}