I have a data set of animal types with ID's and I want to break said data set into Test/Train data sets. I also want to keep all ID's for a respective animal within either the Train or Test data set. An example of the data is below with a random Train/Test split ratio of 80/20.
Animal ID Test/Train
CAT 1 TRAIN
CAT 1 TRAIN
CAT 2 TRAIN
CAT 2 TRAIN
CAT 3 TRAIN
CAT 3 TEST
CAT 4 TRAIN
CAT 4 TRAIN
CAT 5 TEST
CAT 5 TRAIN
DOG 1 TRAIN
DOG 1 TRAIN
DOG 2 TRAIN
DOG 2 TRAIN
DOG 3 TRAIN
DOG 3 TRAIN
DOG 4 TEST
DOG 4 TEST
DOG 5 TRAIN
DOG 5 TRAIN
Note how CAT with ID 3 and ID 5 exists in both Train and Test data sets. Is there a function within scikit-learn train_test_split
that enables the ability to keep all like values in a column within the same train/test data set while maintaining the test ratio? So if CAT with ID 3 has one value flagged as Train data then any other records with CAT and ID 3 would also be flagged as Train data.
Did you keep the stratify parameter as yes if so then remove it and check.