R tidymodels / VIP variable importance determination

6.1k views Asked by At

Via tidymodels and the vip package in R, I computed the variable importance. Code wise it would look like this:

rf_vi_fit %>%
pull_workflow_fit() %>%
vip(geom = "point") + 
labs(title = "Random forest variable importance") 

Visually it would look something like this:

Random forest variable importance

However, what does the variable improtance actually entail? The variable importance can be based on multiple metrics, such as the gain in R-squared or the gini-loss, but I am unsure where the variable importance from the vip is based on. My other predictions has a variable importance of values around 3 to 4 instead of 0.005 as in this model.

I could not find what the variable importance is based on in the vip() documentation either.

1

There are 1 answers

0
hnagaty On

The answer to you inquiry lies in various sections in the vip documentation https://cran.r-project.org/web/packages/vip/vip.pdf.

The vip() function is a wrapper around vi() used to plot the variable importance scores. In the vip() documentation, the ... argument is "Additional optional arguments to be passed on to vi()".

In the vi() function, there is an argument called method.

method = c("model", "firm", "permute", "shap")
Character string specifying the type of variable importance (VI) to compute. Current options are:
"model" (the default), for model-specific VI scores (see vi_model() for details).
"firm", for variance-based VI scores (see vi_firm() for details).
"permute", for permutation-based VI scores (see vi_permute for details).
"shap", for Shapley-based VI scores.
For more details on the variance-based methods, see Greenwell et al. (2018) and Scholbeck et al. (2019).

Then, if you check the documentation of vi_models(), it describes in details the model-specific VI score for each type of model. Below is an excerpt describing RandomForest model specific importance.

Random forests typically provide two measures of variable importance.
The first measure is computed from permuting out-of-bag (OOB) data: for each tree, the prediction error on the OOB portion of the data is recorded (error rate for classification and MSE for regression). Then the same is done after permuting each predictor variable. The difference between the two are then averaged over all trees in the forest, and normalized by the standard deviation of the differences. If the standard deviation of the differences is equal to 0 for a variable, the division is not done (but the average is almost always equal to 0 in that case). See importance for details, including additional arguments that can be passed via the ... argument.
The second measure is the total decrease in node impurities from splitting on the variable, averaged over all trees. For classification, the node impurity is measured by the Gini index. For regression, it is measured by residual sum of squares. See importance for details.