Random data generation leading to good prediction on random labels

Question

Random data generation leading to good prediction on random labels

262 views Asked by TCulos At 27 August 2017 at 00:35

I've been playing around with implementing CV in R but encountered a weird problem with the returned value among folds in LOOCV.

First I'll randomly generate data as well as labels, then I'll fit a randomForest on what should be just noise. From the returned loop I get not only a good AUC but a significant p-value from a t-test. I don't understand how this could be theoretically happening so I was curious if the ways I attempted to generate data/labels was best?

Here is a code snippet that shows my issue.

library(randomForest)
library(pROC)
n=30
p=900

set.seed(3)
XX=matrix(rnorm(n*p, 0, 1) , nrow=n)
YY=as.factor(sample(c('P', 'C'), n, replace=T))
resp = vector()

for(i in 1:n){
  fit = randomForest(XX[-i,], YY[-i])
  pred = predict(fit, XX[i,], type = "prob")[2]
  resp[i] <- pred
}

t.test(resp~YY)$p.value

roc(YY, resp)$auc

I tried multiple ways of generating data all of which result in the same thing

XX=matrix(runif(n*p), nrow=n)
XX=matrix(rnorm(n*p, 0, 1) , nrow=n)

and

random_data=matrix(0, n, p)
for(i in 1:n){
  random_data[i,]=jitter(runif(p), factor = 1, amount = 10)
}
XX=as.matrix(random_data)

Since the randomForest is finding relevant predictors in this scenario that leads me to believe that data may not be truly random. Is there a better possible way I could generate data, or generate the random labels? is it possible that this is an issue with R?

Original Q&A

There are 1 answers

**user2605553** · Answer 1 · 2017-10-05T15:17:10+00:00

This is a partial answer: I modified your roc function call to make sure the distribution of AUC values are between 0 and 1. Then I ran it 20 times. Mean AUC and p-value are 0.73 and 0.12, respectively. Improved but still better than random...

library(ROCR)
library(randomForest)
library(pROC)
n=30
p=900

pvs=vector()
aucs=vector()
for (j in seq(20)){
    XX=matrix(rnorm(n*p, 0, 1) , nrow=n)
    YY=as.factor(sample(c('C', 'P'), n, replace=T))
    resp = vector()
    for(i in 1:n){
        fit = randomForest(XX[-i,], YY[-i])
        pred = predict(fit, XX[i,], type = "prob")[2]
        resp[i] <- pred
    }  
    pvs[j]=t.test(resp~YY)$p.value
    aucs[j]=roc(YY, resp, direction='>')$auc
}

TechQA.

Random data generation leading to good prediction on random labels

There are 1 answers

Related Questions in R

Related Questions in RANDOM

Related Questions in CROSS-VALIDATION

Related Questions in SUPERVISED-LEARNING

Related Questions in DATA-GENERATION

Popular Questions

Popular Tags

Trending Questions