I see that glmnet in R requires categorical variables to be converted into dummy variables first and then be sent to the model. After creating a model with glmnet, I save the model into an RDS file and read that RDS file into a separate script. Here, I also have a test set on which I want to be performing predictions using predict.glmnet. Since the original model has been trained on dummy variables, predict.glmnet requires the test set to also be converted to dummy variables before being passed to predict. My trained data had a column with 3 categories, my test set however only has one category. So R does not let me convert my test set to dummy variable.
I am using model.matrix to perform the conversion and I run into the following error:
- contrasts can be applied only to factors with 2 or more levels
Hence, my prediction script fails before even reaching predict.glmnet.
I tried fixing this error temporarily by introducing another category in the column for the test set. This allowed me to create dummy variables and perform prediction. However, predict.glmnet ran into the following error next:
Error in predict.glmnet(modfile, testdata) : Number of variables in newx must be 28.
This was because my test set had one missing category as compared to the train set so after splitting the categorical variable into dummy variables, the total number of columns was less than the trained model.
Ideally it shouldnt be necessary to have all categories present in the test set, but that seems to be the only case working smoothly for me right now. Looking for alternate approaches to handle this.