I am encountering a ValueError when running an XGBoost model on a multi-label dataset. The error message is:
ValueError: Expected class labels {0,1,2,3,4,5,6,7,8}, got {0,1,3,4,5,6,7,8,9}
My dataset consists of 500 instances, with labels ranging from 0 to 9, making it a total of 10 classes. The minority class (label 9) has 37 instances. Despite this, I am receiving an error indicating an unexpected class label 9.
I have already tried using different values for cross-validation (3, 5, and 10) without success. Interestingly, other models from sklearn, such as Random Forest and AdaBoost, do not produce this error and work as expected.
Has anyone experienced a similar issue with XGBoost or can identify a possible cause for this error? Any insights or suggestions on how to resolve this would be greatly appreciated.
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = xgb.XGBClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(accuracy_score(y_test, y_pred))
Two things I would check:
n_classes - 1.To check that
ymeets this format, tryprint(np.unique(y))- it should be[0, 1, 2, ..., n_classes - 1].If not,
sklearn.preprocessing.LabelEncodercan map the originalyto the appropriate format.To enforce sampling from all labels in both splits, try modifying your code to
train_test_split(..., stratify=y). Everything as before, but this time ensuring that the test split's label distribution is close to the original, so it won't miss any labels.I think stratifying is a good idea even if you weren't getting the error (especially when there's imbalance), as it makes the test split's distribution representative of the original dataset rather than having a distribution that has been distorted by the random sampling process.