Fitting multivariate (multilevel) model with 200 effect sizes (rma.mv with optimParallel) fails in metafor

361 views Asked by At

I have a the following data structure, with approx. studies i = 50, experiments j = 75 and conditions k = 200.

On level k I have dependent measures. For about 20 studies (25 experiments and 65 conditions) I have data on subject level and calculated the variance-covariance matrix. For the rest I calculated an Variance-Covariance matrix from estimated correlations (for subjects and conditions). Finally, I have a complete k x k variance-covariance matrix V.

To respect the multilevel structure of the data I let every condition in every experiment in every study have it's unique covariance using an unstructured variance-covariance matrix (see Details - Specifying Random Effects). Note, that I am not a 100% sure about this reasoning, or reasoning in general for/against variance-covariance assumed structures in multilevel models. So I am happy to receive some thoughts/literature on this...

I now want to conduct a multivariate (multilevel) random effects model with:

rma.mv(
    yi = yk
  , V = V
  , random = list(~ exp_j | stu_i,
                  ~ con_k | exp_j)
  , struct = "UN"
  , method = "REML"
  , test = "t"  ## slightly mimics knha
  , data = dat
  , slab = con_k
  , control=list(optimizer="optimParallel", ncpus=32)
)

When run on the complete data set the calculation reaches 128GB(!) of RAM within a few minutes and at some point R just terminates with out an error message.

1) Is this to be expected with the amount of data I have?

Running the same model with a subset of the original data (i.e. i = 20, j = 25 and k = 65, I just grabbed data without estimated variance-covariance matrices) works fine and reaches a top of ~20GB RAM.

I saw the tipps section of the metafor package as well as the optimisation options for rma.mv() in the notes. 2) In my scenario, does switching to Microsofts R Open or another algorithm (with out parallelisation?!) is reasonable?

Note that the model above is not the final model I want to conduct. No moderators are included yet. Additional model(s) should include regression terms for moderators. It will become even more complex, I guess...

I am running R version 3.6.3 (2020-02-29) on x86_64-pc-linux-gnu (64-bit) under: Ubuntu 18.04.5 LTS. Metafor is on Version 2.4-0.

Best Jonas

1

There are 1 answers

4
Wolfgang On BEST ANSWER

Probably not every study has 50 experiments and not every experiment has 200 conditions, but yes, 50 * 75 * 200 (i.e., 750,000) rows of data would be a problem. However, before I address this issue, let's start with the model itself, which makes little sense. With 75 experiments within those 50 studies, using ~ exp_j | stu_i with struct="UN" implies that you are trying to estimate the variances and covariances of a 75 x 75 var-cov matrix. That's 2850 parameters already. The ~ con_k | exp_j part adds yet another 20,000+ parameters by my calculation. This is never going to work.

Based on your description, you have a multilevel structure, but there is no inherent link between what experiment 1 in study 1 stands for and what experiment 1 in study 2 stands for. So the experiment identifier is just used here to distinguish the different experiments within studies, but carries no further meaning. Compare this with the situation where you have, for example, outcomes A and B in study 1, outcome A in study 2, outcome B in study 3, and so on. 'A' really stands for 'A' in all studies and is not just used to distinguish the elements.

Another issues is that ~ con_k | exp_j will not automatically be nested within studies. The rma.mv() function also allows for crossed random effects, so if you want to add random effects for conditions which in turn are nested within studies then you should create a new variable, for example exp.in.study that reflects this. You could do this with dat$exp.in.study <- paste0(dat$stu_i, ".", dat$exp_j). Then you can use ~ con_k | exp.in.stu to reflect this nesting.

However, based on your description, what I think you really should use is a much simpler model structure, namely random = ~ 1 | stu_i / exp_j / con_k (in that case, the struct argument is not relevant).

Still, if your dataset has 100,000+ rows, then the default way rma.mv() works will become a memory issue, because internally the function will then juggle around with matrices that are of such dimensions. A simple solution to this is to use sparse=TRUE, in which case matrices are stored internally as sparse structures. You probably don't even need any parallel processing then, but you could try if optimizer="optimParallel" will speed things up (but then ncpus=3 is all you need because that is actually the number of variance components that will be estimated by the model if it is specified as suggested above).