ValueError with XGBoost on Multi-Class Dataset: Expected Classes 0-8, Got 0-9

84 views Asked by At

I am encountering a ValueError when running an XGBoost model on a multi-label dataset. The error message is:

ValueError: Expected class labels {0,1,2,3,4,5,6,7,8}, got {0,1,3,4,5,6,7,8,9}

My dataset consists of 500 instances, with labels ranging from 0 to 9, making it a total of 10 classes. The minority class (label 9) has 37 instances. Despite this, I am receiving an error indicating an unexpected class label 9.

I have already tried using different values for cross-validation (3, 5, and 10) without success. Interestingly, other models from sklearn, such as Random Forest and AdaBoost, do not produce this error and work as expected.

Has anyone experienced a similar issue with XGBoost or can identify a possible cause for this error? Any insights or suggestions on how to resolve this would be greatly appreciated.

import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = xgb.XGBClassifier()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print(accuracy_score(y_test, y_pred))
1

There are 1 answers

8
Muhammed Yunus On

Two things I would check:

  1. XGBoost requires class labels starting from 0, going up to n_classes - 1.

By default, XGBoost assumes input categories are integers starting from 0 till the number of categories [0, n_categories) [ref]

To check that y meets this format, try print(np.unique(y)) - it should be [0, 1, 2, ..., n_classes - 1].

If not, sklearn.preprocessing.LabelEncoder can map the original y to the appropriate format.

  1. The relatively small dataset size, combined with specifying a test size of 0.2, means that infrequent labels might not appear in the test set.

To enforce sampling from all labels in both splits, try modifying your code to train_test_split(..., stratify=y). Everything as before, but this time ensuring that the test split's label distribution is close to the original, so it won't miss any labels.

I think stratifying is a good idea even if you weren't getting the error (especially when there's imbalance), as it makes the test split's distribution representative of the original dataset rather than having a distribution that has been distorted by the random sampling process.