(Stratified) KFold vs. train_test_split - What training data is used?

4.3k views Asked by At

I am just a beginner in ML and try to understand what exactly is the advantage of (Stratified) KFold over the classic train_test_split.

The classic train_test_split uses exactly one part for training (in this case 75%) and one part for testing (in this case 25%). Here I know exactly which data points are used for training and testing (see code)

When splitting with the (Stratified) Kfold we use 4 splits with the result that we have 4 different training/test parts. For me it is not clear which of the 4 parts will be used for training/testing the Logistic Regression. Does it make any sense to set this split this way? As far as I understood it, the advantage of (Stratified) Kfold is that you can use all data for training. How would I have to change the code to achieve this?

Creating Data

import pandas as pd
import numpy as np
target = np.ones(25)
target[-5:] = 0
df = pd.DataFrame({'col_a':np.random.random(25),
                  'target':target})
df

train_test_split


from sklearn.model_selection import train_test_split

X = df.col_a
y = df.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, shuffle=True)
print("TRAIN:", X_train.index, "TEST:", X_test.index)

Output:
TRAIN: Int64Index([1, 13, 8, 9, 21, 12, 10, 4, 20, 19, 7, 5, 15, 22, 24, 17, 11, 23], dtype='int64')
TEST: Int64Index([2, 6, 16, 0, 14, 3, 18], dtype='int64')

Stratified KFold

from sklearn.model_selection import StratifiedKFold

X = df.col_a
y = df.target

skf = StratifiedKFold(n_splits=4)
for train_index, test_index in skf.split(X, y):
        X_train, X_test = X.loc[train_index], X.loc[test_index]
        y_train, y_test = y.loc[train_index], y.loc[test_index]
        print("TRAIN:", train_index, "TEST:", test_index)

Output: 
TRAIN: [ 5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 22 23 24] TEST: [ 0  1  2  3  4 20 21]
TRAIN: [ 0  1  2  3  4 10 11 12 13 14 15 16 17 18 19 20 21 23 24] TEST: [ 5  6  7  8  9 22]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 15 16 17 18 19 20 21 22 24] TEST: [10 11 12 13 14 23]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 20 21 22 23] TEST: [15 16 17 18 19 24]

Using Logistic Regression

from sklearn.linear_model import LogisticRegression

X_train = X_train.reshape(-1, 1)
X_test = X_test.reshape(-1, 1)

clf = LogisticRegression()

clf.fit(X_train, y_train)
clf.predict(X_test)
2

There are 2 answers

0
Prateek On

To begin with, they both do the same but how they do it is what makes the difference.

Test-train split:

Test-train split randomly splits the data into test and train sets. There are no rules except the percentage split.

You will only have one train data to train on and one test data to test the model on.

K-fold:

The data is randomly split into multiple combinations of test and train data. The only rule here is the number of combinations.

The problem with splitting the data randomly can cause a class misrepresentation - i.e., one or more of the target classes are represented more in test/train split than the others. It could lead to bias in the training of the model.

To prevent this, the test & train splits must have the the same proportions of the target classes. This can be achieved by using StratifiedKFold.

Link: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html#:~:text=Stratified%20K-Folds%20cross-validator%20Provides%20train%2Ftest%20indices%20to%20split,preserving%20the%20percentage%20of%20samples%20for%20each%20class.

If you like watching videos (watch from ~4.30): https://youtu.be/gJo0uNL-5Qw

Side note: If you are trying to get better training using kfold, then combining StratifiedKFold with GridSearchCV could help.

0
Yana On

train_test_split has a parameter called stratify. Choosing it with regards to the labels makes sure you have equal representation of classes in both train and test data. As per the documentation

If not None, data is split in a stratified fashion, using this as the class labels.