Key Error when Implementing Cross Validation with GroupKFold

55 views Asked by At

I have a df with 3 main columns 'label', 'embeddings' (features), 'chr'. I am trying to do a 10-fold cross validation by grouping the chromosomes such that the chr1 rows are all either in the train or test (not split across the train/test). I have a df that looks like: enter image description here

I believe I did it correctly in my code, but I keep running into this Key Error: enter image description here

Here's my code:

import numpy as np
from sklearn.model_selection import GroupKFold

X = np.array([np.array(x) for x in mini_df['embeddings']])
y = mini_df['label']
groups = mini_df['chromosome']
group_kfold = GroupKFold(n_splits=10)

# Initialize figure for plotting
plt.figure(figsize=(10, 6))

# Perform cross-validation and plot ROC curves for each fold
for i, (train_idx, val_idx) in enumerate(group_kfold.split(X, y, groups)):
    X_train_fold, X_val_fold = X[train_idx], X[val_idx]
    y_train_fold, y_val_fold = y[train_idx], y[val_idx]
    
    # Initialize classifier
    rf_classifier = RandomForestClassifier(n_estimators=n_trees, random_state=42, max_depth=max_depth, n_jobs=-1)
    
    # Train the classifier on this fold
    rf_classifier.fit(X_train_fold, y_train_fold)
    
    # Make predictions on the validation set
    y_pred_proba = rf_classifier.predict_proba(X_val_fold)[:, 1]
    
    # Calculate ROC curve
    fpr, tpr, _ = roc_curve(y_val_fold, y_pred_proba)
    
    # Calculate AUC
    roc_auc = auc(fpr, tpr)
    
    # Plot ROC curve for this fold
    plt.plot(fpr, tpr, lw=1, alpha=0.7, label=f'ROC Fold {i+1} (AUC = {roc_auc:.2f})')

# Plot ROC for random classifier
plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='r', label='Random', alpha=0.8)

# Add labels and legend
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves for Random Forest Classifier')
plt.legend(loc='lower right')
plt.show()
2

There are 2 answers

0
c p On BEST ANSWER

The error appears on the y object, and not on the X object. This means that the X[train_idx] and X[val_idx] operations are executed successfully.

I see that X is a NumPy array, while y is probably a Pandas dataframe or series. You can try converting the Pandas object to a NumPy object (https://pandas.pydata.org/pandas-docs/version/0.24.0rc1/api/generated/pandas.Series.to_numpy.html):

y = mini_df['label'].to_numpy()

or if you want to keep y as a Pandas object then you should access the rows in y by index with iloc[]:

y_train_fold, y_val_fold = y.iloc[train_idx], y.iloc[val_idx]
0
Bill Horvath On

A key error means there's a key missing from a dictionary of values. In this case, the dictionary called y in line 15 doesn't contain a key that equals train_idx or val_idx (or both; it's impossible to tell because you cut off the error message in the image.)

To figure out what the problem is, you could do something like this:

...
assert y.get(train_idx), f"y does not have a {train_idx} key value: {iter(y)}"
assert y.get(val_idx), f"y does not have a {val_idx} key value: {iter(y)}"

X_train_fold, X_val_fold = X[train_idx], X[val_idx]
y_train_fold, y_val_fold = y[train_idx], y[val_idx]
...

The assert statements will test whether y[key] exists using the get call, which won't raise a key error if the key is not among those in y: it will simply return None. If that happens, it will stop execution of the program and print whatever error message you've specified after the comma. In this case, that message will be populated with the key value that's missing and the list of values that are valid.