Train/test split preserving class proportions in each split

3k views Asked by At

Suppose I wanted to split my NER dataset that looks like this:

Data: "Jokowi is the president of Indonesia"
Label: ['B-Person', 'O', 'O', 'O', 'O', 'Country']

Is there any python library or algorithm that makes sure that each class distribution for the train and test dataset is the same? any suggestions would be appreciated

2

There are 2 answers

0
Praks On

You can explore StratifiedShuffleSplit available in Scikit learn library.

4
yatu On

You have sklearn's StratifiedShuffleSplit to do exactly that. From the docs:

The folds are made by preserving the percentage of samples for each class.

StratifiedShuffleSplit returns a generator, containing the indices to split your dataframe into train and test. Here's a sample use case, making it clear that the class proportions are indeed preserved in each split:

from sklearn.model_selection import StratifiedShuffleSplit
import seaborn as sns

X = np.random.randint(0,5,(1200,2))
y = np.random.choice([0,1],size=(1200,), p=[0.8,0.2])

sss = StratifiedShuffleSplit(n_splits=2, test_size=0.2, random_state=0)
train_index, test_index = next(sss.split(X, y))

fig, axes = plt.subplots(1,2, figsize=(10,5))
for split, title, ax in zip([train_index, test_index], 
                     ['Train split', 'Test split'],
                     axes.flatten()):
    sns.countplot(y[split], ax=ax).set_title(title)

enter image description here