I have the Iris data set (Can be found here: https://www.kaggle.com/uciml/iris ), which I should split into a test and a training set. However, I need to split it so that the class distribution in the training and test set is the same as in the complete data set.
I've seen the top answer in this question: how to split a dataset into training and validation set keeping ratio between classes? but since I'm new to both data science and python I am quite lost.
For the Iris data set the first 50 rows are one kind of flower, the next 50 are a second kind and the last 50 are a third kind of flower. How do I write so that I will get eg. 50% test data from each third? I can't really understand where and how they did this in the question linked above. If you could explain this like you would to a child I would really appreciate it.
And does x_train represent the 4 different features of the flower and y_train the kind of flower we have?
Thank you in advance!
EDIT: I tried this
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.5, random_state=65)
but is this a fair way to do it? I was picking different numbers of the random state until I got exactly 25 of each flower type in the test and training set (it was always around 1/3 but with 65 I got it exact). This feels a little bit like cheating tho...
You can use here StratifiedKFold: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html
Also, train_test_split has stratify parameter: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#sklearn.model_selection.train_test_split
Ping me if you need to describe it with an example.