variable encoding in K-fold validation of random forest using package 'caret'

186 views Asked by At

I want to run a RF classification just like it's specified in 'randomForest' but still use the k-fold repeated cross validation method (code below). How do I stop caret from creating dummy variables out of my categorical ones? I read that this may be due to One-Hot-Encoding, but not sure how to change this. I would be very greatful for some example lines on how to fix this!

database:

> str(river)
'data.frame':   121 obs. of  13 variables:
 $ stat_bino     : Factor w/ 2 levels "0","1": 2 2 1 1 2 2 2 2 2 2 ...
 $ Subfamily     : Factor w/ 14 levels "carettochelyinae",..: 14 14 14 14 8 8 8 8 8 8 ...
 $ MAXCL         : num  850 850 360 540 625 600 760 480 560 580 ...
 $ CS            : num  8 8 14 15 26 25.5 20 20 18 21.5 ...
 $ CF            : num  3.5 3.5 2.5 2 1.5 3 2 2 1 1 ...
 $ size_mat      : num  300 300 170 180 450 450 460 406 433 433 ...
 $ incubat       : num  97.5 97.5 71 72.5 91.5 67.5 73 55 83 80 ...
 $ diet          : Factor w/ 5 levels "omnivore leaning carnivore",..: 1 1 1 1 2 2 2 5 4 4 ...
 $ HDI           : num  721 627 878 885 704 ...
 $ HF09M93       : num  23.19 9.96 -8.52 -5.67 27.3 ...
 $ HF09          : num  116 121 110 110 152 ...
 $ deg_reg       : num  8.64 39.37 370.95 314.8 32.99 ...
 $ protected_area: num  7.55 10.93 2.84 2.89 12.71 …

the rest:

> control <- trainControl(method='repeatedcv', 
+                         number=5,repeats = 3, 
+                         search='grid') 

> tunegrid <- expand.grid(.mtry = (1:12)) 

> rf_gridsearch <- train(stat_bino ~ ., 
+                        data = river,
+                        method = 'rf',
+                        metric = 'Accuracy',
+                        ntree = 600,
+                        importance = TRUE,
+                        tuneGrid = tunegrid, trControl = control) 

> rf_gridsearch$finalModel[["xNames"]]
 [1] "Subfamilychelinae"              "Subfamilychelodininae"          "Subfamilychelydrinae"          
 [4] "Subfamilycyclanorbinae"         "Subfamilydeirochelyinae"        "Subfamilydermatemydinae"       
 [7] "Subfamilygeoemydinae"           "Subfamilykinosterninae"         "Subfamilypelomedusinae"        
                   
...you get the picture. I now have 27 predictors instead of 12.
1

There are 1 answers

3
missuse On BEST ANSWER

When you use the formula interface to train:

train(stat_bino ~ ., 
      ...

it will convert factors using dummy coding. This makes sense because formulas in most traditional R functions work this way (for instance lm).

However if you use the non formula interface:

train(y = river$stat_bino,
      x = river[,colnames(river) != "stat_bino"],
      ...

then caret will leave the variables as they are suppled. This is what you want with tree based methods, but it will produce errors with algorithms not capable of internally handling factors such as glmnet.