RandomForest/ caret package: cross-validation with repeated measures interventional trial

36 views Asked by At

I'm seeking confirmation regarding the correctness of my code for performing a recursive random forest analysis using the caret package in R. My dataset stems from a crossover clinical trial investigating the effects of two drugs, in which the same participants received both interventions. The study involves participant measurements at specific time points: weeks 1 (baseline 1), week 4 (post-first treatment), week 6 (baseline 2), and week 10 (post-second treatment).

To explore the data, I've computed the log2fold change in the variables or features of interest between the post-treatment time points and their respective baseline measures (i.e., weeks 4 vs. week 1 and weeks 10 vs. week 6).

My objective is to employ recursive random forest analysis to determine the minimum features necessary for optimal accuracy. However, I'm unsure whether it's appropriate to use data from the same individuals over multiple time points (repeated measures) in this context. I've seen that there is also an option for LOGCV in the caret package, which I am not sure is a better option.

To illustrate what I've done so far, I've provided some dummy code below. Please note that while my actual dataset includes 40 participants and approximately 1000 variables, the dummy dataset below contains data for only 30 participants and around 7 variables for simplicity.

# Create the dataframe
data <- data.frame(
  participant = c(1, 1, 131, 131, 137, 137, 141, 141, 145, 145, 149, 149, 150, 150, 151, 151, 153, 153, 155, 155, 37, 37, 41, 41, 42, 42, 45, 45, 47, 47),
  treatment = c("drug_1", "drug_2", "drug_1", "drug_2", "drug_1", "drug_2", "drug_1", "drug_2", "drug_1", "drug_2", "drug_1", "drug_2", "drug_1", "drug_2", "drug_1", "drug_2", "drug_1", "drug_2", "drug_1", "drug_2", "drug_1", "drug_2", "drug_1", "drug_2", "drug_1", "drug_2", "drug_1", "drug_2", "drug_1", "drug_2"),
  visit = c("Wk10", "Wk4", "Wk4", "Wk10", "Wk4", "Wk10", "Wk4", "Wk10", "Wk10", "Wk4", "Wk4", "Wk10", "Wk4", "Wk10", "Wk10", "Wk4", "Wk4", "Wk10", "Wk10", "Wk4", "Wk10", "Wk4", "Wk10", "Wk4", "Wk10", "Wk4", "Wk10", "Wk4", "Wk10", "Wk4"),
  variable_1 = c(-0.2988943, -0.5652423, -8.5742484, 8.7377093, -6.5147514, 0.12, 0.12, -3.2976162, -0.6719856, -0.2035964, -4.1541421, -7.4718952, 1.472987, 0.12, 0.12, -7.8296141, -3.1094197, -1.2522009, 7.7021801, 0.12, 0.12, 0.12, -0.8054943, 0.2966514, 4.0092985, 7.5577521, 2.6471802, 1.5960873, 0.9594706, -5.1913805),
  variable_2 = c(-0.0917615478, 0.0343429872, 0.2844311406, -0.107225242, 0.1080518689, -0.348411313, 0.0380841868, 0.5028410745, 0.1271225613, 0.0805552861, -0.2264773945, -0.2348926436, -0.1797448622, 0.4007944115, 0.6490342119, 0.6011331044, 0.4770340526, 0.2519560783, -0.0008952533, -0.3603732427, -0.3491374389, -0.1050673759, -0.3334466272, 0.002026175, 0.2781471527, -0.0567692211, 0.2155655051, -0.1622794973, 0.4369879582, 0.027186356),
  variable_3 = c(0.16491095, 0.40996682, 0.37981662, -0.06779959, -0.00648989, -1.43417863, 0.82051403, -0.52134363, -0.13098574, -0.8967957, 0.31982869, 1.14984102, 0.18522299, 0.19590853, 0.37902238, 0.45233503, 0.55827176, -0.10572396, 0.52928305, -0.83166878, -0.23495859, 0.25552564, 0.82702325, 0.63169711, -0.06949792, -0.59649007, 0.24137394, -0.52288169, 0.35617362, -0.45137289),
  variable_4 = c(0.2611792, 0.43167106, 0.44507495, 0.3004693, 0.16412278, 0.13866044, -0.09607666, 0.2950418, 0.43214027, 0.22772244, 0.02310061, 0.12888121, -0.37969123, -0.37482215, 0.05788314, 0.14596032, 0.43424467, 0.32911201, -0.04292683, -0.02168623, 0.20221203, -0.17854298, 0.15251759, -0.02016351, 0.52233188, -0.44278942, 0.45364768, 0.54345488, 0.31248104, 0.13051185),
  variable_5 = c(-0.497772158, 0.053249799, 0.575857449, 0.424288376, -0.345631328, 0.449560174, 0.715486587, 0.118004043, 1.03116253, 1.520391406, 0.585537674, 0.348940052, 0.002841339, 0.172976086, -0.360819587, 0.11496142, 0.989906496, 0.350073575, 0.668875898, 0.349385843, 0.659227256, -0.993280307, 0.545194881, 0.396163127, 0.073841702, -0.708472565, 0.603995835, -0.462563191, -0.126101629, -0.259578254),
  variable_6 = c(-0.150055704, 0.565809818, 0.338459034, 0.398408758, 0.059983373, 0.602544612, 0.996066545, -0.074282435, 0.131827312, 0.105077055, 0.105418235, -0.262076545, 0.160120364, 0.062248289, 0.702139565, 0.465541719, 0.423150078, 0.45313191, 0.57668186, -0.049111172, 0.753235194, -0.302838058, 0.266912534, -0.104407197, -0.113579909, -0.149461752, -0.673984766, -0.677120304, -0.008428875, 0.214567471),
  variable_7 = c(0.33531613, 0.486037571, 0.002506895, 0.102378068, -0.126101629, 0.12, 0.293860198, -0.166554327, -0.461453456, 0.084527415, -0.23362978, -0.085721261, 0.216829539, 0.097091337, -0.099553222, -0.077715746, 0.37797829, 0.174421274, 0.383632364, -0.424361116, 0.172545974, 0.225863552, -0.235674731, 0.873942702, 0.86021603, 0.066224129, -0.173556605, 0.117305891, 0.106073968)
)

Then I created this function:

test_rf_function <- function(df) {
  set.seed(20)
  folds <- groupKFold(df$participant,k=length(unique(df$participant)))
  control <- rfeControl(functions=rfFuncs, method="LOOCV",number = 100,index=folds)
  df_temp <- select(df,-participant,-visit) %>% as.data.frame
  df_centered <- (df_temp[,2:dim(df_temp)[2]])
  rfe_obj <- rfe(df_centered,as.factor(df_temp[,1]), sizes=(2:ncol(df_temp)-1), rfeControl=control)
  return(rfe_obj)
}

Because I've specified the participant ID here in the groupKFold, I wonder if this overcomes the issue of having data with repeated measures. Any advice is greatly appreciated.

Many thanks!

0

There are 0 answers