Objective
I want to perform k-fold cross-validation, but instead of using k-1 dataset for training and k dataset for test, I want to determine the number of training data, exactly like train_test_split
's train_size
. Then the remainder as test data.
To be precise I have binary classification dataset, and I want 10 instances of each class when doing cross val.
Expected Function
Let's say I want to do 5-fold CV:
cross_val_score(estimator=my_model, X, y, cv=5, train_size=20)
And of course in this case my X, y should have >= 100 instances.
My Attempt
Well I just built them manually. The closest I can get is iterating:
for _ in range (5):
X_tr, X_te, y_tr, y_te = train_test_split(X, y, train_size=20, stratified=y)
But this randomly picks the data and may result in two train dataset being alike, plus it doesn't accommodate cv.
Note
Yes, this will result in some dataset not being used for the training set, but that is what I want to achieve in my current work.
Is there any python function that provides this functionality?
You can still use KFold, but with additional logic.
Determine the amount of the test data:
test_amount = total_amount * test_size
.Determine the amount of the splits:
n_splits = total_amount // test_amount
.Use Kfolds: