How to train Matrix Factorization Model in Apache Spark MLlib's ALS Using Training, Test and Validation datasets

1.2k views Asked by At

I want to implement Apache Spark's ALS machine learning algorithm. I found that best model should be chosen to get best results. I have split the training data into three sets Training, Validation and Test as suggest on forums.

I've found following code sample to train model on these sets.

val ranks = List(8, 12)
val lambdas = List(1.0, 10.0)
val numIters = List(10, 20)
var bestModel: Option[MatrixFactorizationModel] = None
var bestValidationRmse = Double.MaxValue
var bestRank = 0
var bestLambda = -1.0
var bestNumIter = -1
for (rank <- ranks; lambda <- lambdas; numIter <- numIters) {
  val model = ALS.train(training, rank, numIter, lambda)
  val validationRmse = computeRmse(model, validation, numValidation)
  if (validationRmse < bestValidationRmse) {
    bestModel = Some(model)
    bestValidationRmse = validationRmse
    bestRank = rank
    bestLambda = lambda
    bestNumIter = numIter
  }
}

val testRmse = computeRmse(bestModel.get, test, numTest)

This code trains model for each combination of rank and lambda and compares rmse (root mean squared error) with validation set. These iterations gives a better model which we can say is represented by (rank,lambda) pair. But it doesn't do much after that on test set. It just computes the rmse with `test' set.

My question is how it can be further tuned with test set data.

1

There are 1 answers

0
soorajmr On

No, one would never fine tune the model using test data. If you do that, it stops being your test data. I'd recommend this section of Prof. Andrew Ng's famous course that discusses the model training process: https://www.coursera.org/learn/machine-learning/home/week/6

Depending on your observation of the error values with validation data set, you might want to add/remove features, get more data or make changes in the model, or maybe even try a different algorithm altogether. If the cross-validation and the test rmse look reasonable, then you are done with the model and you could use it for the purpose (some prediction, I would assume) that made you build it in the first place.