I am building a prediction model in python with two separate training and testing sets. The training data contains numerical type categorical variable, e.g., zip code,[91521,23151,12355, ...], and also string categorical variables, e.g., city ['Chicago', 'New York', 'Los Angeles', ...].
To train the data, I first use the 'pd.get_dummies' to get dummy variable of these variable, and then fit the model with the transformed training data.
I do the same transformation on my test data and predict the result using the trained model. However, I got the error
ValueError: Number of features of the model must match the input. Model n_features is 1487 and input n_features is 1345
The reason is because there are fewer dummy variables in the test data because it has fewer 'city' and 'zipcode'.
How can I solve this problem? For example, 'OneHotEncoder' will only encode all numerical type categorical variable. 'DictVectorizer()' will only encode all string type categorical variable. I search on line and see a few similar questions but none of them really addresses my question.
Handling categorical features using scikit-learn
https://www.quora.com/What-is-the-best-way-to-do-a-binary-one-hot-one-of-K-coding-in-Python
Assume you have identical feature's names in train and test dataset. You can generate concatenated dataset from train and test, get dummies from concatenated dataset and split it to train and test back.
You can do it this way:
In result, you have equal number of features for train and test dataset.