I'm trying to fit a logistic regression model to my data, using glmnet (for lasso) and caret (for k-fold cross-validation). I've tried two different syntaxes, but they both throw an error:
fitControl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 3,
verboseIter = TRUE)
# with response as a integer (0/1)
fit_logistic <- train(response ~.,
data = df_without,
method = "glmnet",
trControl = fitControl,
family = "binomial")
Error in cut.default(y, breaks, include.lowest = TRUE) :
invalid number of intervals
df_without$response <- as.factor(df_without$response)
# with response as a factor
fit_logistic <- train(as.matrix(df_without[1:47]), df_without$response,
method = "glmnet",
trControl = fitControl,
family = "binomial")
Error in lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs, :
NA/NaN/Inf in foreign function call (arg 5)
In addition: Warning message:
In lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs, :
NAs introduced by coercion
Do I need to convert my dataframe to a matrix or not?
Does my response variable need to be a factor or just 0/1 integers?
The .Rdata file with the df_without data frame is here.
sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.1 (Yosemite)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel splines stats graphics grDevices utils datasets methods base
other attached packages:
[1] e1071_1.6-4 plyr_1.8.2 gbm_2.1.1 survival_2.38-1 glmnet_2.0-2 foreach_1.4.2
[7] Matrix_1.2-0 caret_6.0-47 ggplot2_1.0.1 lattice_0.20-31 lubridate_1.3.3 RJDBC_0.2-5
[13] rJava_0.9-6 DBI_0.3.1
loaded via a namespace (and not attached):
[1] Rcpp_0.11.6 compiler_3.2.0 nloptr_1.0.4 class_7.3-12 iterators_1.0.7
[6] tools_3.2.0 digest_0.6.8 lme4_1.1-7 memoise_0.2.1 nlme_3.1-120
[11] gtable_0.1.2 mgcv_1.8-6 brglm_0.5-9 SparseM_1.6 proto_0.3-10
[16] BradleyTerry2_1.0-6 stringr_1.0.0 gtools_3.5.0 grid_3.2.0 nnet_7.3-9
[21] minqa_1.2.4 reshape2_1.4.1 car_2.0-25 magrittr_1.5 scales_0.2.4
[26] codetools_0.2-11 MASS_7.3-40 pbkrtest_0.4-2 colorspace_1.2-6 quantreg_5.11
[31] stringi_0.4-1 munsell_0.4.2
The problem is that you have continuous variables in your dataset. GLMNET needs to have factor of binary variables.
If you run your first lines of code and select a few non-continuous variables you will see that it runs as expected.