Label Encoder can't 'see' previously 'seen' labels

Question

Label Encoder can't 'see' previously 'seen' labels

45 views Asked by Aditya Shandilya At 29 November 2023 at 16:04

I need to encode a column having 4 classes i.e., Education with classes Bachelor's, Master's, PhD, and High School. When I fit the label encoder to the training set (tr, here) and transform the test set (cv, here) I get the error that goes

Error: y contains previously unseen labels: "PhD"

I understand that the error implies that the training set doesn't contain all 4 classes and thus when the encoder comes across the 4th one in the validation set, it throws the above error, but that isn't the case:

tr_df['Education'].value_counts()
Bachelor's     45084
High School    44751
PhD            44586
Master's       44321
Name: Education, dtype: int64

and

cv_df['Education'].value_counts()
Bachelor's     19282
Master's       19220
High School    19152
PhD            18951
Name: Education, dtype: int64

I tried going the other way around as well: fitting the cv_df and transforming the tr_df dataset I got

Error: y contains previously unseen labels: "Bachelor's"

Please let me know what is actually happening here

encodable_columns = ['Education', 'EmploymentType', 'MaritalStatus', 
                     'HasMortgage', 'HasDependents', 'LoanPurpose', 'HasCoSigner']
le = LabelEncoder()
encoded_df = cv_df[encodable_columns].apply(le.fit_transform)
cv_df.drop(columns=encodable_columns, axis=1, inplace=True)
cv_df = pd.concat([tr_df, encoded_df], axis=1)

and

encoded_df = tr_df[encodable_columns].apply(le.transform)
tr_df.drop(columns=encodable_columns, axis=1, inplace=True)
tr_df = pd.concat([tr_df, encoded_df], axis=1)

Threw the error:

ValueError: y contains previously unseen labels: "Bachelor's"

Original Q&A

There are 2 answers

**Aboc** · Answer 1 · 2023-11-29T18:44:06+00:00

I had a similar issue in a different project. What I suggest, to avoid any trouble, is to give a look at the documentation and map all the categorical features by hand. (You can also check OneHotEncoding, which might cause the database to have too many features or LabelEncoder, as you did, for off-the-shelf solution)

My idea is instead to check all possible values (from available sources) and then doing something like:

# Mapping for encoding
education_mapping = {'Bachelor\'s': 0, 'Master\'s': 1, 'PhD': 2, 'High School': 3}  # <-- All possible categories should be here

# Apply mapping to the columns with categorical data in both training and test sets
tr['Education_encoded'] = tr['Education'].map(education_mapping)
cv['Education_encoded'] = cv['Education'].map(education_mapping)

P.S. Fitting or doing such things using the test or validation dataset is 'illegal' xD

**Luca Anzalone** · Answer 2 · 2023-11-29T18:52:28+00:00

The error is due to you trying to fit and transform each row of the dataframe: when you do encoded_df = cv_df[encodable_columns].apply(le.fit_transform).

Try instead to work on the whole column instead, as follows:

for col in encodable_columns:
    label_encoder = LabelEncoder()
    cv_df[col] = label_encoder.fit_transform(cv_df[col])

If you wish you can encode to a different column name, and then concatenate as you do.

TechQA.

Label Encoder can't 'see' previously 'seen' labels

There are 2 answers

Related Questions in PYTHON

Related Questions in SCIKIT-LEARN

Related Questions in LABEL-ENCODING

Popular Questions

Popular Tags

Trending Questions