Train/test split preserving class proportions in each split

Question

Train/test split preserving class proportions in each split

3k views Asked by Rifo Genadi At 14 October 2020 at 12:46

Suppose I wanted to split my NER dataset that looks like this:

Data: "Jokowi is the president of Indonesia"
Label: ['B-Person', 'O', 'O', 'O', 'O', 'Country']

Is there any python library or algorithm that makes sure that each class distribution for the train and test dataset is the same? any suggestions would be appreciated

Original Q&A

There are 2 answers

**Praks** · Answer 1 · 2020-10-14T12:49:24+00:00

Praks On 14 October 2020 at 12:49

You can explore StratifiedShuffleSplit available in Scikit learn library.

**yatu** · Answer 2 · 2020-10-14T12:55:05+00:00

You have sklearn's StratifiedShuffleSplit to do exactly that. From the docs:

The folds are made by preserving the percentage of samples for each class.

StratifiedShuffleSplit returns a generator, containing the indices to split your dataframe into train and test. Here's a sample use case, making it clear that the class proportions are indeed preserved in each split:

from sklearn.model_selection import StratifiedShuffleSplit
import seaborn as sns

X = np.random.randint(0,5,(1200,2))
y = np.random.choice([0,1],size=(1200,), p=[0.8,0.2])

sss = StratifiedShuffleSplit(n_splits=2, test_size=0.2, random_state=0)
train_index, test_index = next(sss.split(X, y))

fig, axes = plt.subplots(1,2, figsize=(10,5))
for split, title, ax in zip([train_index, test_index], 
                     ['Train split', 'Test split'],
                     axes.flatten()):
    sns.countplot(y[split], ax=ax).set_title(title)

TechQA.

Train/test split preserving class proportions in each split

There are 2 answers

Related Questions in PYTHON

Related Questions in MACHINE-LEARNING

Related Questions in TRAIN-TEST-SPLIT

Popular Questions

Popular Tags

Trending Questions