I have to do a random forest in a large train set but I can't use a variable with more of 53 levels.
The factor variable (train$tip
) I need to reduce has 150 levels (KHC, KTF, KGL, ...). How can I (quickly) remove (or hold only 53 levels) levels that appear few times and hold ones with more numerosity?
Have I to write all names of levels I see there are few times or is there a faster method?
train <- train[!train$tip == "KTF", ]
You could do:
table()
computes the frequency of levels;sort()
orders them in decreasing order;names()
gets the level rather than the frequency; and[
selects only the first 53.