Error when making a parallel, binary (logistic) regression for a Sparse matrix with glmnet

385 views Asked by At

I want to perform a parallelised logistic ridge-regression with the glmnet package. My data is a big sparse matrix (10 million observations and around 60k columns).

I did a small trial for a subset of the data (both observations and column subset) and it worked. The following code would be equivalent to what I am doing:

library(Matrix)
library(glmnet)
library(doMC)
#for reproducibility
set.seed(18)
#initialise cores
registerDoMC(cores=2)

sparseMat<-sparseMatrix(i=rep(1:50,4),j=sample(20,200,replace=TRUE),x=rep(1,200))
y<-as.factor(sample(2,50,replace=TRUE))

cvfit<-cv.glmnet(x=sparseMat,y=y,standardize=FALSE,family="binomial",alpha=0,parallel=TRUE)

However, when I input the whole matrix the process crashes providing the next error message:

Error in max(sapply(outlist, function(obj) min(obj$lambda))) : 
invalid 'type' (list) of argument

I am not sure what causes the error and I do not know what the error message is pointing out.

I am using r in an RStudio linux server with 8 cores.

sessionInfo():

R version 3.1.2 (2014-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] doMC_1.3.3      iterators_1.0.7 glmnet_2.0-2    foreach_1.4.2   Matrix_1.1-5   

UPDATE I:

As I cannot share the data that generates the error (confidentiality issues) and the reproductions I tried generated memory overflows rather than the error shown, I will reformulate the question:

Is the error message I got memory related or is it related to something else?

Given the size of the dataset a memory related error is an option. However, the error message points to an internal issue related to having more than one minimum within the lambda values. If it is not a memory issue how shall I proceed, is there a workaround?

0

There are 0 answers