I want to implement Apache Spark's ALS machine learning algorithm. I found that best model should be chosen to get best results. I have split the training data into three sets Training, Validation and Test
as suggest on forums.
I've found following code sample to train model on these sets.
val ranks = List(8, 12)
val lambdas = List(1.0, 10.0)
val numIters = List(10, 20)
var bestModel: Option[MatrixFactorizationModel] = None
var bestValidationRmse = Double.MaxValue
var bestRank = 0
var bestLambda = -1.0
var bestNumIter = -1
for (rank <- ranks; lambda <- lambdas; numIter <- numIters) {
val model = ALS.train(training, rank, numIter, lambda)
val validationRmse = computeRmse(model, validation, numValidation)
if (validationRmse < bestValidationRmse) {
bestModel = Some(model)
bestValidationRmse = validationRmse
bestRank = rank
bestLambda = lambda
bestNumIter = numIter
}
}
val testRmse = computeRmse(bestModel.get, test, numTest)
This code trains model for each combination of rank
and lambda
and compares rmse (root mean squared error) with validation set
. These iterations gives a better model which we can say is represented by (rank,lambda)
pair. But it doesn't do much after that on test
set.
It just computes the rmse with `test' set.
My question is how it can be further tuned with test
set data.
No, one would never fine tune the model using test data. If you do that, it stops being your test data. I'd recommend this section of Prof. Andrew Ng's famous course that discusses the model training process: https://www.coursera.org/learn/machine-learning/home/week/6
Depending on your observation of the error values with validation data set, you might want to add/remove features, get more data or make changes in the model, or maybe even try a different algorithm altogether. If the cross-validation and the test rmse look reasonable, then you are done with the model and you could use it for the purpose (some prediction, I would assume) that made you build it in the first place.