I'm working writing some R extensions on C (C functions to be called from R).
My code needs to compute a statistic using 2 different datasets at the same time, and I need to perform this with all possible pair combinations. Then, I need all these statistics (very large arrays) to continue the calculation on the C side. Those files are very large, typically ~40GB, and that's my problem.
To do this on C called by R, first I need to load all the datasets in R to pass them then to the C function call. But, ideally, it is possible to maintain only 2 of those files on memory at the same time, following the sequence if I were able to access the datasets from C or Fortran directly:
open file1 - open file2 - compute cov(1,2)
close file2
hold file1 - open file3 - compute cov(1,3)
... // same approach
This is fine on R because I can load/unload files, but when calling C or Fortran I haven't any mechanism to load/unload files. So, my question is, can I read .Rdata files from Fortran or C directly, being able to open/close them? Any other approaches to the problem?
As far as I've read, the answer is no. So, I'm considering to move from Rdata to HDF5.
It is not too hard to call R functions from C, using the
.Call
interface. So write an R function that inputs the data, and invoke that from C. When you're done with one file, UNPROTECT() the data you've read in. This is illustrated in the followingA simpler approach is to invert the problem -- read two data files in, then call C with the data.