Reduce levels from factor variable

945 views Asked by At

I have to do a random forest in a large train set but I can't use a variable with more of 53 levels.

The factor variable (train$tip) I need to reduce has 150 levels (KHC, KTF, KGL, ...). How can I (quickly) remove (or hold only 53 levels) levels that appear few times and hold ones with more numerosity?

Have I to write all names of levels I see there are few times or is there a faster method?

train <- train[!train$tip == "KTF", ]
1

There are 1 answers

3
scoa On BEST ANSWER

You could do:

train <- train[train$tip %in% names(sort(table(train$tip), decreasing = TRUE))[1:53], ]

table() computes the frequency of levels; sort() orders them in decreasing order; names() gets the level rather than the frequency; and [ selects only the first 53.