mapreduce way to calculate user similarity matrix

1.4k views Asked by At

I have a list of many users (over 10 million) each of which is represented by a userid followed by 10 floating-point numbers indicating their preference. I would like to efficiently calculate the user similarity matrix using cosine similarity based on mapreduce. However, since the values are floating-point numbers, it is hard to determine a key in the mapreduce framework. Any suggestions?


There are 1 answers

Astronaut7 On

I think the easiest solution would be the Mahout library. There are a couple of map-reduce similarity matrix jobs in Mahout that might work for your use case.

The first is Mahout's ItemSimilarityJob that is part of its recommender system libraries. The specific info for that job can be found here. You would simply need to provide the input data in the required format and choose your VectorSimilarityMeasure (which for your case would be SIMILARITY_COSINE) along with any additional optimizations. Since you are looking to calculate user-user similarity based on a preference vector of ten floating point value, what you could do is assign a simple 1-to-10 numeric hash for the indices of the vector and generate a simple .csv file of vectorIndex, userID, decimalValue as input for the Mahout item-similarity job (the userID being a numeric Int or Long value). The resulting output should be a tab separated text file of userID,userID,similarity.

A second solution might be Mahout's RowSimilarityJob included in its math library. I've never used it myself, but some info can be found here and in this previous stackoverflow thread. Rather than a .csv as input, you would need to translate your input data as a DistributedRowMatrix, the userIDs being the rows of the matrix. The output, I believe, will also be a DistributedRowMatrix sequence file containing the user-user similarity data you are seeking.

I suppose which solution is better depends on what input/output format you prefer. All the best.