Discripencies in variable importance calculation for glmnet model in R

908 views Asked by At

I want to calculate variable importance for glmnet model in R. I am using glmnet package for fitting the elastic net model like

library(glmnet)
library(caret)
library(vip)

data_y <- as.vector(mtcars$mpg)
data_x <- as.matrix(mtcars[-1])

fit.glmnet <- glmnet(data_x, data_y, family="gaussian")

set.seed(123)
cvfit.glmnet = cv.glmnet(data_x, data_y, standardize=T)
cvfit.glmnet$lambda.min
coef(cvfit.glmnet, s = "lambda.min")

Then I have used vip package for variable importance as

#Using vip package
vip::vi_model(cvfit.glmnet, s = cvfit.glmnet$fit$lambda)

which returns me

># A tibble: 10 x 3
   Variable Importance Sign 
   <chr>         <dbl> <chr>
 1 cyl         -0.886  NEG  
 2 disp         0      NEG  
 3 hp          -0.0117 NEG  
 4 drat         0      NEG  
 5 wt          -2.71   NEG  
 6 qsec         0      NEG  
 7 vs           0      NEG  
 8 am           0      NEG  
 9 gear         0      NEG  
10 carb         0      NEG 

The variable importance contains both positive and negative values for the variables at the same time it does not vary between 0-1 or 0-100%.

Then I have tried customised function from this answer

#Using function provided in this example
varImp <- function(object, lambda = NULL, ...) {
  
  ## skipping a few lines
  
  beta <- predict(object, s = lambda, type = "coef")
  if(is.list(beta)) {
    out <- do.call("cbind", lapply(beta, function(x) x[,1]))
    out <- as.data.frame(out)
  } else out <- data.frame(Overall = beta[,1])
  out <- abs(out[rownames(out) != "(Intercept)",,drop = FALSE])
  out
}

varImp(cvfit.glmnet, lambda = cvfit.glmnet$lambda.min)

It returns me following output

        Overall
cyl  0.88608541
disp 0.00000000
hp   0.01168438
drat 0.00000000
wt   2.70814703
qsec 0.00000000
vs   0.00000000
am   0.00000000
gear 0.00000000
carb 0.00000000

Though the output from customised function does not contain negative values, it does vary within 0-1 or 0-100%.

I know that caret package has varImpfunction which gives variable importance between 0-100%. But I want to implement the same thing for cv.glmnet object instead of caret::train object. How can I achieve the variable importance alike caret package for cv.glmnet object?

1

There are 1 answers

4
missuse On BEST ANSWER

The question asks how to obtain glmnet variable importance between 0-100%.

If it is desired to assign importance based on coefficient magnitude at a certain (usually optimal) penalty. And if these coefficients are derived based on standardized variables (default in glmnet) then the coefficients can simply be scaled to the 0 - 1 range:

The slightly modified function is given:

varImp <- function(object, lambda = NULL, ...) {
  beta <- predict(object, s = lambda, type = "coef")
  if(is.list(beta)) {
    out <- do.call("cbind", lapply(beta, function(x) x[,1]))
    out <- as.data.frame(out)
  } else out <- data.frame(Overall = beta[,1])
  out <- abs(out[rownames(out) != "(Intercept)",,drop = FALSE])
  out <- out/max(out)
  out[order(out$Overall, decreasing = TRUE),,drop=FALSE]
}

Using the example in the question:

varImp(cvfit.glmnet, lambda = cvfit.glmnet$lambda.min)
#output
         Overall
wt   1.000000000
cyl  0.320796270
am   0.004840186
hp   0.004605913
disp 0.000000000
drat 0.000000000
qsec 0.000000000
vs   0.000000000
gear 0.000000000
carb 0.000000000

Another approach at assigning variable importance to glmnet models would be scoring the variables based on the penalty for inclusion - Variables are more significant if the are excluded at higher penalties. This approach will be implemented in the mlr3 package: https://github.com/mlr-org/mlr3learners/issues/28 at some point