I need to encode a column having 4 classes i.e., Education with classes Bachelor's, Master's, PhD, and High School. When I fit the label encoder to the training set (tr, here) and transform the test set (cv, here) I get the error that goes
Error: y contains previously unseen labels: "PhD"
I understand that the error implies that the training set doesn't contain all 4 classes and thus when the encoder comes across the 4th one in the validation set, it throws the above error, but that isn't the case:
tr_df['Education'].value_counts()
Bachelor's 45084
High School 44751
PhD 44586
Master's 44321
Name: Education, dtype: int64
and
cv_df['Education'].value_counts()
Bachelor's 19282
Master's 19220
High School 19152
PhD 18951
Name: Education, dtype: int64
I tried going the other way around as well: fitting the cv_df and transforming the tr_df dataset I got
Error: y contains previously unseen labels: "Bachelor's"
Please let me know what is actually happening here
encodable_columns = ['Education', 'EmploymentType', 'MaritalStatus',
'HasMortgage', 'HasDependents', 'LoanPurpose', 'HasCoSigner']
le = LabelEncoder()
encoded_df = cv_df[encodable_columns].apply(le.fit_transform)
cv_df.drop(columns=encodable_columns, axis=1, inplace=True)
cv_df = pd.concat([tr_df, encoded_df], axis=1)
and
encoded_df = tr_df[encodable_columns].apply(le.transform)
tr_df.drop(columns=encodable_columns, axis=1, inplace=True)
tr_df = pd.concat([tr_df, encoded_df], axis=1)
Threw the error:
ValueError: y contains previously unseen labels: "Bachelor's"
I had a similar issue in a different project. What I suggest, to avoid any trouble, is to give a look at the documentation and map all the categorical features by hand. (You can also check OneHotEncoding, which might cause the database to have too many features or LabelEncoder, as you did, for off-the-shelf solution)
My idea is instead to check all possible values (from available sources) and then doing something like:
P.S. Fitting or doing such things using the test or validation dataset is 'illegal' xD