How can I implement a Leave One Patient Out Cross Validation in Python?

169 views Asked by At

I have a dataset of roughly 1600 samples. The whole dataset if made from 22 patients in total. Some patients contribute 250 samples, other patients just 10. This is a balanced dataset in total. I have around 800 samples for each class but the dataset of each individual patient is not balanced. I want to perform a binary classification with a 'Leave One Patient Out' cross Validation on that dataset.

I have a Patient ID linked to every sample in the entire dataset. I have split up my dataset in 80% train and 20% test. Is there any way I can implement said cross validation with sklearn? Possibly with the LeaveOneOut function?

I have tried to do it manually by writing a function that iterates over every patient and splitting the training data (80% of whole data) into another set of training and test data. The training data consists of samples of 21/22 patients and the test data of samples of 1/22 patients. I tested my classifier (Random Forest) on the 1/22 patients data and stored the accuracy. I repeated that process for every patient (22x) and calculated the mean accuracy of all patients. The result is around 55%. Then I tested my classifier on the remaining 20% of my whole dataset and received an accuracy of around 77%.

# Train Data (80%) and Test Data (20%)

clf_gini = RandomForestClassifier(criterion="gini",
                                      random_state=90, max_depth=8,
                                      min_samples_leaf=10)
accuracy_list = []      # Creating Empty List for Accuracy of every Cross Validation (22x)

for patient in patients:    # Iterate over every Patient (22x)

    # Function: Use Train Data to Create Train (21/22 Patients) and Test (1/22) Dataset for Cross Validation
    x_train, y_train, x_test, y_test = loocv(train_data, patient, num_feat)

    # Performing training
    clf_gini.fit(x_train, y_train)      # Train Classifier with data of 21 Patients
    y_pred = clf_gini.predict(x_test)   # Apply Classifier on remaining Patient to receive predicted Classes
    accuracy = accuracy_score(y_test, y_pred)   # Calculate Accuracy of Classification on remaining Patient
    accuracy_list.append(accuracy)  # Append Accuracy to list
# Exit loop
y_pred_test = clf_gini.predict(test_data[:, 1:-1])  # Apply Classifier to Test Dataset of all 22 Patients
mean_accuracy = np.mean(accuracy_list)              # Mean Accuracy of all Cross Validations
cal_accuracy(y_pred_test, test_data[:, 0])    # Function that prints Confusion Matrix, Test Accuracy, Report

Is it to be expected that my cross validation performed poorly compared to the accuracy on the test set? Is the way I approached this feasible or nonsense? Any input would be very much appreciated!

0

There are 0 answers