Calculating stats for random subsample using R loop

1.4k views Asked by At

I am trying to find a way in R to randomly subset some data (proportion of suitable habitat in an area for an ecological study), calculate a mean and proportion of samples with values > 0 and then save or append those values to a dataframe. I then want to repeat this a number of times (1000 for the example). Standard bootstrapping or resampling packages won't work as I need to calculate freq of occurance as well as the mean of the subsample. I'm aware of the "apply" functions, but those loop over the entire data frame whereas I'm trying to do it on a subsample repeated. I know I need some code to get the calculated values in the loop to save and output but having issues. "habprop" is a column in a dataframe ("data") that I want to calculate the mean and proportion of positive values for and save.

for(i in 1000 {  
randsample=data[sample(1:nrow(data), 50, replace=FALSE),]
m=mean(randsample$habprop)
randsamplepos=subset(randsample, habprop > 0)
habfreq=(nrow(randsamplepos)/nrow(randsample))
})
2

There are 2 answers

1
Rorschach On

Using boot this should be possible

dat <- data.frame(habprop=rnorm(100))

## Function to return statistics from subsamples
stat <- function(dat, inds)
    with(dat, c(mu=mean(habprop[inds]), freq=sum(habprop[inds] > 0)/length(inds)))

library(boot)
boot(data=dat, statistic=stat, R=1000)

# Bootstrap Statistics :
#        original      bias    std. error
# t1* -0.06154533 -0.00324393  0.08377116
# t2*  0.52000000 -0.00073000  0.04853991
2
ajb On

How about the replicate function? This post looks pretty similar.

Generating some data to work on

data <- data.frame(x1=rpois(5000, 5), x2=runif(5000), x3=rnorm(5000))

Defining a function to sample and take means and counts

sample_stats <- function(df, n=100){
  df <- df[sample(1:nrow(df), n, replace=F),]
  mx1 <- mean(df$x1[df$x1>0])
  x1pos <- sum(df$x1>0)
  return(c(mx1, x1pos))
}

run it once just to see output

sample_stats(data)

run it 1000 times

results <- replicate(1000, sample_stats(data, n=100))