cforest varimp does not seem to work with categorical predictors

3.1k views Asked by At

I am trying to run a random forest model using the Party package. I would like to use the varimp function to determine conditional variable importance, however it does not seem to accept categorical variables. Here is a link to my data and below is the code I am using.

> #set up dataframe
> bll = read.csv("bll_Nov2013.csv", header=TRUE)
> SB_Pres <- bll$Sandbar_Presence #binary presence/absnece
> Slope <-bll$Slope
> Dist2Shr <-bll$Dist2Shr
> Bathy <-bll$Bathy2
> Chla <-bll$GSM_Chl_Daily_MF
> SST <-bll$SST_PF_daily
> Region <- bll$Region
> MoonPhase <-bll$MoonPhase
> DaylightHours <- bll$DaylightHours
> bll_SB <- na.omit(data.frame(SB_Pres, Slope, Dist2Shr, Bathy, Chla, SST, DaylightHours, MoonPhase, Region))

> #run cforest model
> SBcf<- cforest(formula = factor(SB_Pres) ~ SST + Chla + Dist2Shr+ DaylightHours + Bathy + Slope + MoonPhase + factor(Region), data = bll_SB,  control = cforest_unbiased())
> SBcf

     Random Forest using Conditional Inference Trees

Number of trees:  500 

Response:  factor(SB_Pres) 
Inputs:  SST, Chla, Dist2Shr, DaylightHours, Bathy, Slope, MoonPhase, factor(Region) 
Number of observations:  534 

> #Varimp works if conditional = FALSE
> varimp(SBcf, conditional = FALSE)
           SST           Chla       Dist2Shr  DaylightHours          Bathy          Slope 
   0.024744898    0.084244898    0.015632653    0.009571429    0.006448980    0.003357143 
     MoonPhase factor(Region) 
   0.002724490    0.095000000 


> #Varimp does NOT work if conditional = TRU
> varimp(SBcf, conditional = TRUE)
Error in model.frame.default(formula = ~SST + Chla + Dist2Shr + DaylightHours +  : 
  variable lengths differ (found for 'factor(Region)')

If I drop the factor(Region) variable then conditional variable importance can be calculated.

Is this a known behavior of the party package varimp function with categorical predictors? From what I've read it should be able to handle categorical predictors (Conditional variable importance for random forests - Strobl et al) - it does not explicitly say that varimp(obj, conditional = TRUE) can be used with categorical predictors.

Any insight would be greatly appreciated!

Thanks,

Liza

EDIT: Illustrating that if you define the variable using as.factor outside of the formula, the as.factor does not actually take effect - results are the same whether region is specified as a factor or not. Compare these results to the other varimp (conditional = false) run above, where the output shows the variable as "factor(Region)", whereas below it just shows up as "Region" in both runs.

> library("party")
> packageDescription("party")$Version
[1] "1.0-10"
> bll = read.csv("bll_SB.csv", header=TRUE)
> bll_SB <- na.omit(data.frame(bll))

> # region is specified as a factor
> bll_SB$SB_Pres <- factor(bll_SB$SB_Pres)
> bll_SB$Region <- factor(bll_SB$Region)
> set.seed(1)
> SBcf <- cforest(SB_Pres ~ ., data=bll_SB,  control=cforest_unbiased())
> SBcf


     Random Forest using Conditional Inference Trees

Number of trees:  500 

Response:  SB_Pres 
Inputs:  Slope, Dist2Shr, Bathy, Chla, SST, DaylightHours, MoonPhase, Region 
Number of observations:  534 

> system.time(res1 <- varimp(SBcf, conditional = FALSE))
   user  system elapsed 
  4.466   0.013   4.480 
> res1
        Slope      Dist2Shr         Bathy          Chla           SST DaylightHours 
  0.003632653   0.015908163   0.008285714   0.085367347   0.028846939   0.009520408 
    MoonPhase        Region 
  0.002969388   0.093061224 


> # Run again, region is not specified as a factor
> bll_SB$Region <- bll_SB$Region
> set.seed(1)
> SBcf <- cforest(SB_Pres ~ ., data=bll_SB,  control=cforest_unbiased())
> system.time(res2 <- varimp(SBcf, conditional = FALSE))
   user  system elapsed 
  4.562   0.015   4.578 
> res2
        Slope      Dist2Shr         Bathy          Chla           SST DaylightHours 
  0.003632653   0.015908163   0.008285714   0.085367347   0.028846939   0.009520408 
    MoonPhase        Region 
  0.002969388   0.093061224 
1

There are 1 answers

1
rcs On BEST ANSWER

I couldn't observe a problem in your example. I was able to compute the conditional variable importance for your data set using the following code:

R> library("party")
R> packageDescription("party")$Version
[1] "1.0-10"

R> bll = read.csv("bll_SB.csv", header=TRUE)
R>
R> bll_SB <- na.omit(data.frame(bll))
R> bll_SB$SB_Pres <- factor(bll_SB$SB_Pres)
R> bll_SB$Region <- factor(bll_SB$Region)
R>
R> set.seed(1)
R> SBcf <- cforest(SB_Pres ~ ., data=bll_SB,  control=cforest_unbiased())
R> SBcf  
#
#          Random Forest using Conditional Inference Trees
#
# Number of trees:  500
#
# Response:  SB_Pres
# Inputs:  Slope, Dist2Shr, Bathy, Chla, SST, DaylightHours, MoonPhase, Region
# Number of observations:  534

R> system.time(res1 <- varimp(SBcf, conditional = FALSE))
#   user  system elapsed
#  5.971   0.012   5.994
R> system.time(res2 <- varimp(SBcf, conditional = TRUE))
#   user  system elapsed
# 2704.1    58.2  2768.0
R> res1 
#         Slope      Dist2Shr         Bathy          Chla           SST
#      0.003633      0.015908      0.008286      0.085367      0.028847
# DaylightHours     MoonPhase        Region
#      0.009520      0.002969      0.093061
R> res2 
#         Slope      Dist2Shr         Bathy          Chla           SST
#    -6.122e-05     2.449e-03    -4.082e-05     1.004e-02     3.367e-03
# DaylightHours     MoonPhase        Region
#     5.714e-04     6.735e-04     1.067e-02