Apologies in advance if this is a slightly abstract question.
I have some modular data analysis code that in the abstract goes something like this:
source("load_analysis_settings.R")
source("load_and_clean_data.R")
source("analyse_data.R")
The analysis in analyse_data.R actually runs on hundreds of subsets of the data, using a for loop (unfortunately there are too many steps, and too complicated, for me to successfully parallelise using apply, including fitting some brms models).
When I try to do this in one sitting, R/RStudio quickly runs out of memory (or something, I don't know enough about what's going on under the hood) and slows to a crawl, and it can take literally days to run.
The workaround I've used so far is to manually break up the hundreds of subsets of data into more manageable chunks, saving the output periodically, restarting R/RStudio to run the next chunk, and then recombining them all at the end. Basically:
source("load_analysis_settings.R")
source("load_and_clean_data.R")
subset_data_to_part_1
source("analyse_data.R")
save_analysis_output_part_1
(Rinse and repeat for part 2, 3, etc. until done, then manually recombine at the end)
This is much faster than trying to do it all at once (e.g. I could run 4 parts in 3 hours each, 12 hours total, where trying to do them all at once would take about 48-72 hours). However, it requires manual intervention through restarting RStudio and incrementing the part # between running each part.
In effect, I'm manually functioning as an overarching for loop:
for (part in 1:length(parts)) {
specify_part_settings
run_analysis_code
manually_restart_R
}
recombine_results
I'm wondering if there's some way to automate this process with an actual for loop so it doesn't require manual intervention for restarting. Incrementing the settings is easy, but I don't know how to get the speed benefits of actually restarting (memory-clearing? dunno what else is going on) with each iteration. I can add remove for any large outputs and gc() into the overarching loop. But I've tried using gc() in the workflow before and it doesn't seem to help that much.
What else is going on memory-wise when RStudio restarts, and how could I get those actions into the overarching for loop? or is there a way to build an actual restart into it?