Using Cross-Validation on a Scikit-Learn Classifer

5k views Asked by At

I have a working classifier with a dataset split in a train set (70%) and a test set (30%).

However, I'd like to implement a validation set as well (so that: 70% train, 20% validation and 10% test). The sets should be randomly chosen and the results should be averaged over 10 different assignments.

Any ideas how to do this? Below is my implementation using a train and test set only:

def classifier(samples):
    # load the datasets
    dataset = samples

    data_train, data_test, target_train, target_test = train_test_split(dataset["data"], dataset["target"], test_size=0.30, random_state=42)

    # fit a k-nearest neighbor model to the data
    model = KNeighborsClassifier()
    model.fit(data_train, target_train)
    print(model)

    # make predictions
    expected = target_test
    predicted = model.predict(data_test)

    # summarize the fit of the model
    print(metrics.classification_report(expected, predicted))
    print(metrics.confusion_matrix(expected, predicted))
2

There are 2 answers

0
Ami Tavory On BEST ANSWER

For what you're describing, you just need to use train_test_split with a following split on its results.

Adapting the tutorial there, start with something like this:

import numpy as np
from sklearn import cross_validation
from sklearn import datasets
from sklearn import svm

iris = datasets.load_iris()
iris.data.shape, iris.target.shape
((150, 4), (150,))

Then, just like there, make the initial train/test partition:

X_train, X_test, y_train, y_test = cross_validation.train_test_split(iris.data, iris.target, test_size=0.1, random_state=0)

Now you just need to split the 0.9 of the train data into two more parts:

X_train_cv_train, X_test_cv_train, y_train_cv_train, y_test_cv_train = \
cross_validation.train_test_split(X_train, y_train, test_size=0.2/0.9)

If you want 10 random train/test cv sets, repeat the last line 10 times (this will give you sets with overlap).

Alternatively, you could replace the last line with 10-fold validation (see the relevant classes).

The main point is to build the CV sets from the train part of the initial train/test partition.

0
Lenwood On

For k-fold cross validation (note that this is not the same k as your kNN classifier), divide your training set up into k sections. Let's say 5 as a starting point. You'll create 5 models on your training data, each one tested against a portion. What this means is that your model will have been both trained and tested against each data point in your training set. Wikipedia has a much more detailed description of cross-validation than I've given here.

You can then test against your test set, adjust as necessary, and finally check against your validation set.

Scikit Learn has a well documented method for this.