Splitting data set into training and test data, keeping the ratio

Question

Splitting data set into training and test data, keeping the ratio

3.6k views Asked by user10411263 At 02 October 2018 at 14:18

I have the Iris data set (Can be found here: https://www.kaggle.com/uciml/iris ), which I should split into a test and a training set. However, I need to split it so that the class distribution in the training and test set is the same as in the complete data set.

I've seen the top answer in this question: how to split a dataset into training and validation set keeping ratio between classes? but since I'm new to both data science and python I am quite lost.

For the Iris data set the first 50 rows are one kind of flower, the next 50 are a second kind and the last 50 are a third kind of flower. How do I write so that I will get eg. 50% test data from each third? I can't really understand where and how they did this in the question linked above. If you could explain this like you would to a child I would really appreciate it.

And does x_train represent the 4 different features of the flower and y_train the kind of flower we have?

Thank you in advance!

EDIT: I tried this

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.5, random_state=65)

but is this a fair way to do it? I was picking different numbers of the random state until I got exactly 25 of each flower type in the test and training set (it was always around 1/3 but with 65 I got it exact). This feels a little bit like cheating tho...

Original Q&A

There are 2 answers

murat yalçın On 08 October 2018 at 17:50

sklearn.model_selection.train_test_split

has shuffle and stratify parameters.

for default shuffle = True and stratify=None

If you are dealing with regression, train_test_split by default will shuffle the data for you.

If you are dealing with classification, you need to specify stratify = << your response variable >>

For more info please check the documentation

Thanks

**andrewchauzov** · Accepted Answer · 2018-10-07T05:24:13+00:00

You can use here StratifiedKFold: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html

Also, train_test_split has stratify parameter: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#sklearn.model_selection.train_test_split

Ping me if you need to describe it with an example.

TechQA.

Splitting data set into training and test data, keeping the ratio

There are 2 answers

Related Questions in SPLIT

Related Questions in DATASET

Related Questions in DATA-SCIENCE

Related Questions in PYTHON-IRIS

Related Questions in IRIS-DATASET

Popular Questions

Popular Tags

Trending Questions