Splitting data set into training and test data, keeping the ratio

3.6k views Asked by At

I have the Iris data set (Can be found here: https://www.kaggle.com/uciml/iris ), which I should split into a test and a training set. However, I need to split it so that the class distribution in the training and test set is the same as in the complete data set.

I've seen the top answer in this question: how to split a dataset into training and validation set keeping ratio between classes? but since I'm new to both data science and python I am quite lost.

For the Iris data set the first 50 rows are one kind of flower, the next 50 are a second kind and the last 50 are a third kind of flower. How do I write so that I will get eg. 50% test data from each third? I can't really understand where and how they did this in the question linked above. If you could explain this like you would to a child I would really appreciate it.

And does x_train represent the 4 different features of the flower and y_train the kind of flower we have?

Thank you in advance!

EDIT: I tried this

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.5, random_state=65)

but is this a fair way to do it? I was picking different numbers of the random state until I got exactly 25 of each flower type in the test and training set (it was always around 1/3 but with 65 I got it exact). This feels a little bit like cheating tho...

2

There are 2 answers

1
andrewchauzov On BEST ANSWER
0
murat yalçın On

sklearn.model_selection.train_test_split

has shuffle and stratify parameters.

for default shuffle = True and stratify=None

If you are dealing with regression, train_test_split by default will shuffle the data for you.

If you are dealing with classification, you need to specify stratify = << your response variable >>

For more info please check the documentation

Thanks