Too small RMSE. Recommender systems

1.1k views Asked by At

Sorry, I'am newbie at recommender systems, but i wrote few lines of code using apache mahout lib. Well, my dataset is pretty small, 500x100 with 8102 cells known.

So, my dataset is actually a subset of Yelp dataset from "Yelp business rating prediction" competition. I just take top 100 most commented restaurants, and then take 500 most active customers.

I created SVDRecommender and then I evaluated RMSE. And so the result is about 0.4... Why is it so small? Maybe i just don't understand something and my dataset is not so sparse, but then i tried with larger and more sparse dataset and RMSE become even smaller (about 0.18)! Could anyone explain me such behaviour?

DataModel model = new FileDataModel(new File("datamf.csv"));
final RatingSGDFactorizer factorizer = new RatingSGDFactorizer(model, 20, 200);
final Factorization f = factorizer.factorize();

RecommenderBuilder builder = new RecommenderBuilder() {
            public Recommender buildRecommender(DataModel model) throws TasteException {
                //build here whatever existing or customized recommendation algorithm
                return new SVDRecommender(model, factorizer);

RecommenderEvaluator evaluator = new RMSRecommenderEvaluator();
        double score = evaluator.evaluate(builder,


There are 1 answers

Dan Jarratt On

RMSE is calculated by looking at predicted ratings versus their hidden ground-truth. So a sparse dataset may only have very few hidden ratings to predict, or your algorithm may not be able to predict for many hidden ratings because there's no correlation to other ratings. This means that even though your RMSE is low ("better"), your coverage will be low because you aren't predicting very many items.

There's another issue: RMSE is completely dataset dependent. On the MovieLens ratings dataset which has star ratings 0.5 to 5.0 stars, an RMSE of roughly 0.9 is common. But on another dataset with 0.0 to 1.0 points, I've observed an RMSE of around 0.2. Look at the properties of your dataset and see if 0.4 makes sense.