I am just a beginner in ML and try to understand what exactly is the advantage of (Stratified) KFold over the classic train_test_split.
The classic train_test_split uses exactly one part for training (in this case 75%) and one part for testing (in this case 25%). Here I know exactly which data points are used for training and testing (see code)
When splitting with the (Stratified) Kfold we use 4 splits with the result that we have 4 different training/test parts. For me it is not clear which of the 4 parts will be used for training/testing the Logistic Regression. Does it make any sense to set this split this way? As far as I understood it, the advantage of (Stratified) Kfold is that you can use all data for training. How would I have to change the code to achieve this?
Creating Data
import pandas as pd
import numpy as np
target = np.ones(25)
target[-5:] = 0
df = pd.DataFrame({'col_a':np.random.random(25),
'target':target})
df
train_test_split
from sklearn.model_selection import train_test_split
X = df.col_a
y = df.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, shuffle=True)
print("TRAIN:", X_train.index, "TEST:", X_test.index)
Output:
TRAIN: Int64Index([1, 13, 8, 9, 21, 12, 10, 4, 20, 19, 7, 5, 15, 22, 24, 17, 11, 23], dtype='int64')
TEST: Int64Index([2, 6, 16, 0, 14, 3, 18], dtype='int64')
Stratified KFold
from sklearn.model_selection import StratifiedKFold
X = df.col_a
y = df.target
skf = StratifiedKFold(n_splits=4)
for train_index, test_index in skf.split(X, y):
X_train, X_test = X.loc[train_index], X.loc[test_index]
y_train, y_test = y.loc[train_index], y.loc[test_index]
print("TRAIN:", train_index, "TEST:", test_index)
Output:
TRAIN: [ 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 22 23 24] TEST: [ 0 1 2 3 4 20 21]
TRAIN: [ 0 1 2 3 4 10 11 12 13 14 15 16 17 18 19 20 21 23 24] TEST: [ 5 6 7 8 9 22]
TRAIN: [ 0 1 2 3 4 5 6 7 8 9 15 16 17 18 19 20 21 22 24] TEST: [10 11 12 13 14 23]
TRAIN: [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 20 21 22 23] TEST: [15 16 17 18 19 24]
Using Logistic Regression
from sklearn.linear_model import LogisticRegression
X_train = X_train.reshape(-1, 1)
X_test = X_test.reshape(-1, 1)
clf = LogisticRegression()
clf.fit(X_train, y_train)
clf.predict(X_test)
To begin with, they both do the same but how they do it is what makes the difference.
Test-train split:
Test-train split randomly splits the data into test and train sets. There are no rules except the percentage split.
You will only have one train data to train on and one test data to test the model on.
K-fold:
The data is randomly split into multiple combinations of test and train data. The only rule here is the number of combinations.
The problem with splitting the data randomly can cause a class misrepresentation - i.e., one or more of the target classes are represented more in test/train split than the others. It could lead to bias in the training of the model.
To prevent this, the test & train splits must have the the same proportions of the target classes. This can be achieved by using StratifiedKFold.
Link: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html#:~:text=Stratified%20K-Folds%20cross-validator%20Provides%20train%2Ftest%20indices%20to%20split,preserving%20the%20percentage%20of%20samples%20for%20each%20class.
If you like watching videos (watch from ~4.30): https://youtu.be/gJo0uNL-5Qw
Side note: If you are trying to get better training using kfold, then combining StratifiedKFold with GridSearchCV could help.