Cross validation with specified number of training data?

163 views Asked by At

Objective

I want to perform k-fold cross-validation, but instead of using k-1 dataset for training and k dataset for test, I want to determine the number of training data, exactly like train_test_split 's train_size. Then the remainder as test data.

To be precise I have binary classification dataset, and I want 10 instances of each class when doing cross val.

Expected Function

Let's say I want to do 5-fold CV:

cross_val_score(estimator=my_model, X, y, cv=5, train_size=20)

And of course in this case my X, y should have >= 100 instances.

My Attempt

Well I just built them manually. The closest I can get is iterating:

for _ in range (5):    
  X_tr, X_te, y_tr, y_te = train_test_split(X, y, train_size=20, stratified=y)  

But this randomly picks the data and may result in two train dataset being alike, plus it doesn't accommodate cv.

Note

Yes, this will result in some dataset not being used for the training set, but that is what I want to achieve in my current work.

Is there any python function that provides this functionality?

1

There are 1 answers

0
Danylo Baibak On

You can still use KFold, but with additional logic.

Determine the amount of the test data: test_amount = total_amount * test_size.

Determine the amount of the splits: n_splits = total_amount // test_amount.

Use Kfolds:

kf = KFold(n_splits=n_splits)
for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]