Label Encoder can't 'see' previously 'seen' labels

51 views Asked by At

I need to encode a column having 4 classes i.e., Education with classes Bachelor's, Master's, PhD, and High School. When I fit the label encoder to the training set (tr, here) and transform the test set (cv, here) I get the error that goes

Error: y contains previously unseen labels: "PhD"

I understand that the error implies that the training set doesn't contain all 4 classes and thus when the encoder comes across the 4th one in the validation set, it throws the above error, but that isn't the case:

tr_df['Education'].value_counts()
Bachelor's     45084
High School    44751
PhD            44586
Master's       44321
Name: Education, dtype: int64

and

cv_df['Education'].value_counts()
Bachelor's     19282
Master's       19220
High School    19152
PhD            18951
Name: Education, dtype: int64

I tried going the other way around as well: fitting the cv_df and transforming the tr_df dataset I got

Error: y contains previously unseen labels: "Bachelor's"

Please let me know what is actually happening here

encodable_columns = ['Education', 'EmploymentType', 'MaritalStatus', 
                     'HasMortgage', 'HasDependents', 'LoanPurpose', 'HasCoSigner']
le = LabelEncoder()
encoded_df = cv_df[encodable_columns].apply(le.fit_transform)
cv_df.drop(columns=encodable_columns, axis=1, inplace=True)
cv_df = pd.concat([tr_df, encoded_df], axis=1)

and

encoded_df = tr_df[encodable_columns].apply(le.transform)
tr_df.drop(columns=encodable_columns, axis=1, inplace=True)
tr_df = pd.concat([tr_df, encoded_df], axis=1)

Threw the error:

ValueError: y contains previously unseen labels: "Bachelor's"
2

There are 2 answers

0
Aboc On

I had a similar issue in a different project. What I suggest, to avoid any trouble, is to give a look at the documentation and map all the categorical features by hand. (You can also check OneHotEncoding, which might cause the database to have too many features or LabelEncoder, as you did, for off-the-shelf solution)

My idea is instead to check all possible values (from available sources) and then doing something like:

# Mapping for encoding
education_mapping = {'Bachelor\'s': 0, 'Master\'s': 1, 'PhD': 2, 'High School': 3}  # <-- All possible categories should be here

# Apply mapping to the columns with categorical data in both training and test sets
tr['Education_encoded'] = tr['Education'].map(education_mapping)
cv['Education_encoded'] = cv['Education'].map(education_mapping)

P.S. Fitting or doing such things using the test or validation dataset is 'illegal' xD

2
Luca Anzalone On

The error is due to you trying to fit and transform each row of the dataframe: when you do encoded_df = cv_df[encodable_columns].apply(le.fit_transform).

Try instead to work on the whole column instead, as follows:

for col in encodable_columns:
    label_encoder = LabelEncoder()
    cv_df[col] = label_encoder.fit_transform(cv_df[col])

If you wish you can encode to a different column name, and then concatenate as you do.