How to capture the most important variables in Bootstrapped models in R?

Question

How to capture the most important variables in Bootstrapped models in R?

379 views Asked by jad At 22 March 2022 at 01:33

I have several models that I would like to compare their choices of important predictors over the same data set, Lasso being one of them. The data set I am using consists of census data with around a thousand variables that have been renamed to "x1", "x2" and so on for convenience sake (The original names are extremely long). I would like to report the top features then rename these variables with a shorter more concise name.

My attempt to solve this is by extracting the top variables in each iterated model, put it into a list, then finding the mean of the top variables in X amount of loops. However, my issue is I still find variability with the top 10 most used predictors and so I cannot manually alter the variable names as each run on the code chunk yields different results. I suspect this is because I have so many variables in my analysis and due to CV causing the creation of new models every bootstrap.

For the sake of a simple example I used mtcars and will look for the top 3 most common predictors due to only having 10 variables in this data set.

library(glmnet)

data("mtcars") # Base R Dataset
df <- mtcars


topvar <- list()

for (i in 1:100) {
  
  # CV and Splitting
  
  ind <- sample(nrow(df), nrow(df), replace = TRUE)
  ind <- unique(ind)
  
  train <- df[ind, ]
  xtrain <- model.matrix(mpg~., train)[,-1]
  ytrain <- df[ind, 1]
  
  test <- df[-ind, ]
  xtest <- model.matrix(mpg~., test)[,-1]
  ytest <- df[-ind, 1]
  
  # Create Model per Loop
 
  model <- glmnet(xtrain, ytrain, alpha = 1, lambda = 0.2) 
                     
  # Store Coeffecients per loop
  
  coef_las <- coef(model, s = 0.2)[-1, ] # Remove intercept
  
  # Store all nonzero Coefficients
  
  topvar[[i]] <- coef_las[which(coef_las != 0)]
  
}

# Unlist 

varimp <- unlist(topvar)

# Count all predictors

novar <- table(names(varimp))

# Find the mean of all variables

meanvar <- tapply(varimp, names(varimp), mean)

# Return top 3 repeated Coefs

repvar <- novar[order(novar, decreasing = TRUE)][1:3]

# Return mean of repeated Coefs

repvar.mean <- meanvar[names(repvar)]

repvar

Now if you were to rerun the code chunk above you would notice that the top 3 variables change and so if I had to rename these variables it would be difficult to do if they are not constant and changing every run. Any suggestions on how I could approach this?

Original Q&A

There are 1 answers

**Brian Syzdek** · Answer 1 · 2022-03-22T03:28:20+00:00

Brian Syzdek On 22 March 2022 at 03:28

You can use function set.seed() to ensure your sample will return the same sample each time. For example

set.seed(123)

When I add this to above code and then run twice, the following is returned both times:

  wt carb   hp 
  98   89   86

TechQA.

How to capture the most important variables in Bootstrapped models in R?

There are 1 answers

Related Questions in R

Related Questions in FEATURE-SELECTION

Related Questions in LASSO-REGRESSION

Related Questions in VARIABLE-SELECTION

Popular Questions

Trending Questions