I have a working classifier with a dataset split in a train set (70%) and a test set (30%).
However, I'd like to implement a validation set as well (so that: 70% train, 20% validation and 10% test). The sets should be randomly chosen and the results should be averaged over 10 different assignments.
Any ideas how to do this? Below is my implementation using a train and test set only:
def classifier(samples):
# load the datasets
dataset = samples
data_train, data_test, target_train, target_test = train_test_split(dataset["data"], dataset["target"], test_size=0.30, random_state=42)
# fit a k-nearest neighbor model to the data
model = KNeighborsClassifier()
model.fit(data_train, target_train)
print(model)
# make predictions
expected = target_test
predicted = model.predict(data_test)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))
For what you're describing, you just need to use
train_test_split
with a following split on its results.Adapting the tutorial there, start with something like this:
Then, just like there, make the initial train/test partition:
Now you just need to split the 0.9 of the train data into two more parts:
If you want 10 random train/test cv sets, repeat the last line 10 times (this will give you sets with overlap).
Alternatively, you could replace the last line with 10-fold validation (see the relevant classes).
The main point is to build the CV sets from the train part of the initial train/test partition.