Split Train Test Data sets keeping like values together

1k views Asked by At

I have a data set of animal types with ID's and I want to break said data set into Test/Train data sets. I also want to keep all ID's for a respective animal within either the Train or Test data set. An example of the data is below with a random Train/Test split ratio of 80/20.

Animal  ID  Test/Train
CAT 1   TRAIN
CAT 1   TRAIN
CAT 2   TRAIN
CAT 2   TRAIN
CAT 3   TRAIN
CAT 3   TEST
CAT 4   TRAIN
CAT 4   TRAIN
CAT 5   TEST
CAT 5   TRAIN
DOG 1   TRAIN
DOG 1   TRAIN
DOG 2   TRAIN
DOG 2   TRAIN
DOG 3   TRAIN
DOG 3   TRAIN
DOG 4   TEST
DOG 4   TEST
DOG 5   TRAIN
DOG 5   TRAIN

Note how CAT with ID 3 and ID 5 exists in both Train and Test data sets. Is there a function within scikit-learn train_test_split that enables the ability to keep all like values in a column within the same train/test data set while maintaining the test ratio? So if CAT with ID 3 has one value flagged as Train data then any other records with CAT and ID 3 would also be flagged as Train data.

2

There are 2 answers

1
Aditya Jha On

Did you keep the stratify parameter as yes if so then remove it and check.

1
Davide Pietrasanta On

I found the solution to your request: Here's a link!

from sklearn.model_selection import GroupShuffleSplit 

splitter = GroupShuffleSplit(test_size=0.2, n_splits=2, random_state = 7)
split = splitter.split(df, groups=df['ID'])
train_inds, test_inds = next(split)

train = df.iloc[train_inds]
test = df.iloc[test_inds]