I am working on a basketball model that predicts how well an NBA player will play in their next game, based on how well they have performed in all previous games of the season. There are roughly 10 players per NBA team, and each of 30 teams has played about 25 games this season, so my dataframe has about 10*30*25 = 7,500 observations at this point. I run my model each day, predicting how well players will play in the next day - therefore, for tomorrow I will make roughly 10*30 = 300 predictions.
My question is this - currently i have about 50 columns / features / x-variables that I am using for prediction, all of which are numeric variables (average number of points scored, average number of rebounds, etc.). However, I think it may help my model to know which player each row corresponds to. That is, I want to pass a 51st column, a factor variable including the players names. I read online that GBM can deal with factor variables as it will "dummify" them internally, however I am worried that "dummifying" 300 different players will not perform well. Will passing a factor variable with all of the player names backfire and ultimately hurt my model, due to the large number of dummy variables it will create internally, or is this okay?
my_df
PLAYER FG FGA X3P X3PA FT FTA
1042 Andre Drummond 6 16 0 0 6 10
17747 Marcus Morris 6 19 1 4 5 6
14861 Kentavious Caldwell-Pope 7 14 4 7 3 3
7976 Ersan Ilyasova 6 12 3 6 1 2
22401 Reggie Jackson 4 10 2 4 5 5
24475 Stanley Johnson 3 10 1 3 0 0
24649 Steve Blake 1 6 1 5 0 0
12489 Jodie Meeks 1 4 0 0 0 0
1955 Aron Baynes 3 5 0 0 0 0
21500 Paul Millsap 7 15 2 6 3 4
I have used factor variables with a large number of levels in gbm and the biggest problem you will face with that is that your computation time will significantly increase.(which may not be a problem for your case as the dataset you are using is small) Also, when you plot variable importance
each factor level shows up separately, which can make it pretty confusing to asses importance.
Apart from this, you mention that you have 7,500 observations, 50 features and 300 different players. If you consider adding player name as a variable that would mean approx 25 obs per player, which is a pretty small sample to work with and may mean that your model wont generalize well. So my personal suggestion would be to abstain from doing so.
However, I see the point of why you would want to do so and would suggest that you try clustering the players (using player-specific criteria or maybe even some features you already have) and then use the cluster a player belongs to as a variable.
Hope this helps! :)