I have not been able to find an example of bootstrapping correlations between two data frames. Key posts I have looked at are (1) https://stats.stackexchange.com/questions/20701/computing-p-value-using-bootstrap-with-r/ (2) How to bootstrap correlation using vectorised function applied to large matrix?
NB/ p values here need to be obtained through calculating the ratios of how many times the absolute bootstrap test statistics exceeded the theoretical ones,
Fortunately, I have also come across a recent example of using Map to apply the correlations between two data frames. How to use lapply to replace nested for loop to get correlations between two data frames?
I have large datasets that will be run in a unix based HPC and also a Windows OS option for running the calculations on smaller datasets in R.
D1 <- data.frame(matrix(runif(10*10, 0, 2), ncol=10))
D2 <- data.frame(matrix(runif(10*16, 0, 2), ncol=16))
colnames(D1) <- paste0("a", 1:ncol(D1))
colnames(D2) <- paste0("b", 1:ncol(D2))
compare <- expand.grid(colnames(D1), colnames(D2))
need_modify <- Map(function(x,y) cor.test(D1[, x], D2[, y], method="spearman"), compare$Var1, compare$Var2) %>%
lapply(`[`, c('estimate', 'statistic', 'p.value')) %>%
sapply(unlist) %>% t() %>% as.data.frame() %>% mutate(Var1=compare$Var1, Var2=compare$Var2)
boot_df <- function(x) x[sample(nrow(x), rep = T), ]
#number of bootstraps
R <- 100
I am stuck on modifying the above so that it run successfully using parallelisation for Unix based OS (mclapply or mcMap) and also a separate one for Windows (clusterMap or future_mapply).
Grateful for any pointers in the right direction or an example elsewhere.
Technically you could resample B times with replacement, extract the
"statistic"
s,and according to Davison & Hinkley (1997)'s formula,
calculate a Monte Carlo P-value add up how many times the actual statistic,
is exceeded divided by B (using a bias correction by adding
+1
to both, the numerator and denominator).Of course, we can combine everything and simplify to:
Note, that the answer focuses on how to do it technically. Read Davison & Hinkley (1997) or consult a statistician if you really want it to be sound.
Data: