I want to split data into train,test and validation datasets which are stratification, but sklearn only provides cross_validation.train_test_split which only can divide into 2 pieces. What should i do if i want do this
how can I split data in 3 or more parts with sklearn
7k views Asked by loseryao At
2
There are 2 answers
0
On
You can also use train_test_split
more than once to achieve this. The second time, run it on the training output from the first call to train_test_split
.
from sklearn.model_selection import train_test_split
def train_test_validate_stratified_split(features, targets, test_size=0.2, validate_size=0.1):
# Get test sets
features_train, features_test, targets_train, targets_test = train_test_split(
features,
targets,
stratify=targets,
test_size=test_size
)
# Run train_test_split again to get train and validate sets
post_split_validate_size = validate_size / (1 - test_size)
features_train, features_validate, targets_train, targets_validate = train_test_split(
features_train,
targets_train,
stratify=targets_train,
test_size=post_split_validate_size
)
return features_train, features_test, features_validate, targets_train, targets_test, targets_validate
If you want to use a Stratified Train/Test split, you can use StratifiedKFold in Sklearn
Suppose
X
is your features andy
are your labels, based on the example here :Update : To split data into say 3 different percentages use numpy.split() can be done like this :