variable encoding in K-fold validation of random forest using package 'caret'

Question

variable encoding in K-fold validation of random forest using package 'caret'

180 views Asked by Chente At 23 October 2020 at 07:16

I want to run a RF classification just like it's specified in 'randomForest' but still use the k-fold repeated cross validation method (code below). How do I stop caret from creating dummy variables out of my categorical ones? I read that this may be due to One-Hot-Encoding, but not sure how to change this. I would be very greatful for some example lines on how to fix this!

database:

> str(river)
'data.frame':   121 obs. of  13 variables:
 $ stat_bino     : Factor w/ 2 levels "0","1": 2 2 1 1 2 2 2 2 2 2 ...
 $ Subfamily     : Factor w/ 14 levels "carettochelyinae",..: 14 14 14 14 8 8 8 8 8 8 ...
 $ MAXCL         : num  850 850 360 540 625 600 760 480 560 580 ...
 $ CS            : num  8 8 14 15 26 25.5 20 20 18 21.5 ...
 $ CF            : num  3.5 3.5 2.5 2 1.5 3 2 2 1 1 ...
 $ size_mat      : num  300 300 170 180 450 450 460 406 433 433 ...
 $ incubat       : num  97.5 97.5 71 72.5 91.5 67.5 73 55 83 80 ...
 $ diet          : Factor w/ 5 levels "omnivore leaning carnivore",..: 1 1 1 1 2 2 2 5 4 4 ...
 $ HDI           : num  721 627 878 885 704 ...
 $ HF09M93       : num  23.19 9.96 -8.52 -5.67 27.3 ...
 $ HF09          : num  116 121 110 110 152 ...
 $ deg_reg       : num  8.64 39.37 370.95 314.8 32.99 ...
 $ protected_area: num  7.55 10.93 2.84 2.89 12.71 …

the rest:

> control <- trainControl(method='repeatedcv', 
+                         number=5,repeats = 3, 
+                         search='grid') 

> tunegrid <- expand.grid(.mtry = (1:12)) 

> rf_gridsearch <- train(stat_bino ~ ., 
+                        data = river,
+                        method = 'rf',
+                        metric = 'Accuracy',
+                        ntree = 600,
+                        importance = TRUE,
+                        tuneGrid = tunegrid, trControl = control) 

> rf_gridsearch$finalModel[["xNames"]]
 [1] "Subfamilychelinae"              "Subfamilychelodininae"          "Subfamilychelydrinae"          
 [4] "Subfamilycyclanorbinae"         "Subfamilydeirochelyinae"        "Subfamilydermatemydinae"       
 [7] "Subfamilygeoemydinae"           "Subfamilykinosterninae"         "Subfamilypelomedusinae"        
                   
...you get the picture. I now have 27 predictors instead of 12.

Original Q&A

There are 1 answers

**missuse** · Accepted Answer · 2020-10-23T09:03:23+00:00

When you use the formula interface to train:

train(stat_bino ~ ., 
      ...

it will convert factors using dummy coding. This makes sense because formulas in most traditional R functions work this way (for instance lm).

However if you use the non formula interface:

train(y = river$stat_bino,
      x = river[,colnames(river) != "stat_bino"],
      ...

then caret will leave the variables as they are suppled. This is what you want with tree based methods, but it will produce errors with algorithms not capable of internally handling factors such as glmnet.

TechQA.

variable encoding in K-fold validation of random forest using package 'caret'

There are 1 answers

Related Questions in R

Related Questions in RANDOM-FOREST

Related Questions in R-CARET

Related Questions in ONE-HOT-ENCODING

Related Questions in K-FOLD

Popular Questions

Popular Tags

Trending Questions