I want to perform a parallelised logistic ridge-regression with the glmnet package. My data is a big sparse matrix (10 million observations and around 60k columns).
I did a small trial for a subset of the data (both observations and column subset) and it worked. The following code would be equivalent to what I am doing:
library(Matrix)
library(glmnet)
library(doMC)
#for reproducibility
set.seed(18)
#initialise cores
registerDoMC(cores=2)
sparseMat<-sparseMatrix(i=rep(1:50,4),j=sample(20,200,replace=TRUE),x=rep(1,200))
y<-as.factor(sample(2,50,replace=TRUE))
cvfit<-cv.glmnet(x=sparseMat,y=y,standardize=FALSE,family="binomial",alpha=0,parallel=TRUE)
However, when I input the whole matrix the process crashes providing the next error message:
Error in max(sapply(outlist, function(obj) min(obj$lambda))) :
invalid 'type' (list) of argument
I am not sure what causes the error and I do not know what the error message is pointing out.
I am using r in an RStudio linux server with 8 cores.
sessionInfo()
:
R version 3.1.2 (2014-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] doMC_1.3.3 iterators_1.0.7 glmnet_2.0-2 foreach_1.4.2 Matrix_1.1-5
UPDATE I:
As I cannot share the data that generates the error (confidentiality issues) and the reproductions I tried generated memory overflows rather than the error shown, I will reformulate the question:
Is the error message I got memory related or is it related to something else?
Given the size of the dataset a memory related error is an option. However, the error message points to an internal issue related to having more than one minimum within the lambda values. If it is not a memory issue how shall I proceed, is there a workaround?