GBM handling factor variables, worried about too many factors

3k views Asked by At

I am working on a basketball model that predicts how well an NBA player will play in their next game, based on how well they have performed in all previous games of the season. There are roughly 10 players per NBA team, and each of 30 teams has played about 25 games this season, so my dataframe has about 10*30*25 = 7,500 observations at this point. I run my model each day, predicting how well players will play in the next day - therefore, for tomorrow I will make roughly 10*30 = 300 predictions.

My question is this - currently i have about 50 columns / features / x-variables that I am using for prediction, all of which are numeric variables (average number of points scored, average number of rebounds, etc.). However, I think it may help my model to know which player each row corresponds to. That is, I want to pass a 51st column, a factor variable including the players names. I read online that GBM can deal with factor variables as it will "dummify" them internally, however I am worried that "dummifying" 300 different players will not perform well. Will passing a factor variable with all of the player names backfire and ultimately hurt my model, due to the large number of dummy variables it will create internally, or is this okay?

my_df                        
                        PLAYER FG FGA X3P X3PA FT FTA
1042            Andre Drummond  6  16   0    0  6  10
17747            Marcus Morris  6  19   1    4  5   6
14861 Kentavious Caldwell-Pope  7  14   4    7  3   3
7976            Ersan Ilyasova  6  12   3    6  1   2
22401           Reggie Jackson  4  10   2    4  5   5
24475          Stanley Johnson  3  10   1    3  0   0
24649              Steve Blake  1   6   1    5  0   0
12489              Jodie Meeks  1   4   0    0  0   0
1955               Aron Baynes  3   5   0    0  0   0
21500             Paul Millsap  7  15   2    6  3   4
2

There are 2 answers

0
bia On

I have used factor variables with a large number of levels in gbm and the biggest problem you will face with that is that your computation time will significantly increase.(which may not be a problem for your case as the dataset you are using is small) Also, when you plot variable importance

gbm_model <- train(A0 ~ ., 
                 data = training, 
                 method="gbm",
                 distribution = "bernoulli",
                 metric="ROC",
                 maximise=TRUE,
                 tuneGrid=grid,
                 train.fraction = 0.6,
                 trControl=ctrl) 
ggplot(varImp(gbm_model, scale=TRUE))

each factor level shows up separately, which can make it pretty confusing to asses importance.

Apart from this, you mention that you have 7,500 observations, 50 features and 300 different players. If you consider adding player name as a variable that would mean approx 25 obs per player, which is a pretty small sample to work with and may mean that your model wont generalize well. So my personal suggestion would be to abstain from doing so.

However, I see the point of why you would want to do so and would suggest that you try clustering the players (using player-specific criteria or maybe even some features you already have) and then use the cluster a player belongs to as a variable.

Hope this helps! :)

0
Juan Sebastián Calderón On

I have the same proble with function gbm, for instance i added a randomn factor with 100 levels and it appears as the most influent variable.