logistic regression with caret and glmnet in R

5.7k views Asked by At

I'm trying to fit a logistic regression model to my data, using glmnet (for lasso) and caret (for k-fold cross-validation). I've tried two different syntaxes, but they both throw an error:

fitControl <- trainControl(method = "repeatedcv",
                       number = 10,
                       repeats = 3,
                       verboseIter = TRUE)

# with response as a integer (0/1)
fit_logistic <- train(response ~.,
                   data = df_without,
                   method = "glmnet",
                   trControl = fitControl,
                   family = "binomial")

Error in cut.default(y, breaks, include.lowest = TRUE) : 
 invalid number of intervals

df_without$response <- as.factor(df_without$response)
# with response as a factor
fit_logistic <- train(as.matrix(df_without[1:47]), df_without$response,
              method = "glmnet",
              trControl = fitControl,
              family = "binomial")

Error in lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs,  : 
  NA/NaN/Inf in foreign function call (arg 5)
In addition: Warning message:
In lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs,  :
  NAs introduced by coercion

Do I need to convert my dataframe to a matrix or not?

Does my response variable need to be a factor or just 0/1 integers?

The .Rdata file with the df_without data frame is here.

sessionInfo()

R version 3.2.0 (2015-04-16)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.1 (Yosemite)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  splines   stats     graphics  grDevices utils         datasets  methods   base     

other attached packages:
 [1] e1071_1.6-4     plyr_1.8.2      gbm_2.1.1       survival_2.38-1     glmnet_2.0-2    foreach_1.4.2  
 [7] Matrix_1.2-0    caret_6.0-47    ggplot2_1.0.1   lattice_0.20-31     lubridate_1.3.3 RJDBC_0.2-5    
[13] rJava_0.9-6     DBI_0.3.1      

loaded via a namespace (and not attached):
 [1] Rcpp_0.11.6         compiler_3.2.0      nloptr_1.0.4            class_7.3-12        iterators_1.0.7    
 [6] tools_3.2.0         digest_0.6.8        lme4_1.1-7              memoise_0.2.1       nlme_3.1-120       
[11] gtable_0.1.2        mgcv_1.8-6          brglm_0.5-9             SparseM_1.6         proto_0.3-10       
[16] BradleyTerry2_1.0-6 stringr_1.0.0       gtools_3.5.0            grid_3.2.0          nnet_7.3-9         
[21] minqa_1.2.4         reshape2_1.4.1      car_2.0-25              magrittr_1.5        scales_0.2.4       
[26] codetools_0.2-11    MASS_7.3-40         pbkrtest_0.4-2          colorspace_1.2-6    quantreg_5.11      
[31] stringi_0.4-1       munsell_0.4.2  
2

There are 2 answers

1
phiver On

The problem is that you have continuous variables in your dataset. GLMNET needs to have factor of binary variables.

If you run your first lines of code and select a few non-continuous variables you will see that it runs as expected.

0
felix000 On

I had the same problem, I fixed mine using the function model.matrix to deal with the coding of categorical variables.

Try this for the x argument in glmnet:

as.matrix(model.matrix(response ~ .)[, -1])

I removed the intercept column because the default in glmnet is to include an intercept.