I am trying to run a random forest model using the Party package. I would like to use the varimp
function to determine conditional variable importance, however it does not seem to accept categorical variables. Here is a link to my data and below is the code I am using.
> #set up dataframe
> bll = read.csv("bll_Nov2013.csv", header=TRUE)
> SB_Pres <- bll$Sandbar_Presence #binary presence/absnece
> Slope <-bll$Slope
> Dist2Shr <-bll$Dist2Shr
> Bathy <-bll$Bathy2
> Chla <-bll$GSM_Chl_Daily_MF
> SST <-bll$SST_PF_daily
> Region <- bll$Region
> MoonPhase <-bll$MoonPhase
> DaylightHours <- bll$DaylightHours
> bll_SB <- na.omit(data.frame(SB_Pres, Slope, Dist2Shr, Bathy, Chla, SST, DaylightHours, MoonPhase, Region))
> #run cforest model
> SBcf<- cforest(formula = factor(SB_Pres) ~ SST + Chla + Dist2Shr+ DaylightHours + Bathy + Slope + MoonPhase + factor(Region), data = bll_SB, control = cforest_unbiased())
> SBcf
Random Forest using Conditional Inference Trees
Number of trees: 500
Response: factor(SB_Pres)
Inputs: SST, Chla, Dist2Shr, DaylightHours, Bathy, Slope, MoonPhase, factor(Region)
Number of observations: 534
> #Varimp works if conditional = FALSE
> varimp(SBcf, conditional = FALSE)
SST Chla Dist2Shr DaylightHours Bathy Slope
0.024744898 0.084244898 0.015632653 0.009571429 0.006448980 0.003357143
MoonPhase factor(Region)
0.002724490 0.095000000
> #Varimp does NOT work if conditional = TRU
> varimp(SBcf, conditional = TRUE)
Error in model.frame.default(formula = ~SST + Chla + Dist2Shr + DaylightHours + :
variable lengths differ (found for 'factor(Region)')
If I drop the factor(Region)
variable then conditional variable importance can be calculated.
Is this a known behavior of the party package varimp
function with categorical predictors? From what I've read it should be able to handle categorical predictors (Conditional variable importance for random forests - Strobl et al) - it does not explicitly say that varimp(obj, conditional = TRUE)
can be used with categorical predictors.
Any insight would be greatly appreciated!
Thanks,
Liza
EDIT: Illustrating that if you define the variable using as.factor outside of the formula, the as.factor does not actually take effect - results are the same whether region is specified as a factor or not. Compare these results to the other varimp (conditional = false) run above, where the output shows the variable as "factor(Region)", whereas below it just shows up as "Region" in both runs.
> library("party")
> packageDescription("party")$Version
[1] "1.0-10"
> bll = read.csv("bll_SB.csv", header=TRUE)
> bll_SB <- na.omit(data.frame(bll))
> # region is specified as a factor
> bll_SB$SB_Pres <- factor(bll_SB$SB_Pres)
> bll_SB$Region <- factor(bll_SB$Region)
> set.seed(1)
> SBcf <- cforest(SB_Pres ~ ., data=bll_SB, control=cforest_unbiased())
> SBcf
Random Forest using Conditional Inference Trees
Number of trees: 500
Response: SB_Pres
Inputs: Slope, Dist2Shr, Bathy, Chla, SST, DaylightHours, MoonPhase, Region
Number of observations: 534
> system.time(res1 <- varimp(SBcf, conditional = FALSE))
user system elapsed
4.466 0.013 4.480
> res1
Slope Dist2Shr Bathy Chla SST DaylightHours
0.003632653 0.015908163 0.008285714 0.085367347 0.028846939 0.009520408
MoonPhase Region
0.002969388 0.093061224
> # Run again, region is not specified as a factor
> bll_SB$Region <- bll_SB$Region
> set.seed(1)
> SBcf <- cforest(SB_Pres ~ ., data=bll_SB, control=cforest_unbiased())
> system.time(res2 <- varimp(SBcf, conditional = FALSE))
user system elapsed
4.562 0.015 4.578
> res2
Slope Dist2Shr Bathy Chla SST DaylightHours
0.003632653 0.015908163 0.008285714 0.085367347 0.028846939 0.009520408
MoonPhase Region
0.002969388 0.093061224
I couldn't observe a problem in your example. I was able to compute the conditional variable importance for your data set using the following code: