Collaborative Filtering / Recommendation System performance and approaches

3.4k views Asked by At

I'm really interested to find out how people approach collaborative filtering and recommendation engines etc. I mean this more in terms of performance of the script than anything. I have stated reading Programming Collective Intelligence, which is really interesting but tends to focus more on the algorithmic side of things.

I currently only have 2k users, but my current system is proving to be totally not future proof and very taxing on the server already. The entire system is based on making recommendations of posts to users. My application is PHP/MySQL but I use some MongoDB for my collaborative filtering stuff - I'm on a large Amazon EC2 instance. My setup is really a 2 step process. First I calculate similarities between items, then I use this information to make recommendations. Here's how it works:

First my system calculates similarities between users posts. The script runs an algorithm which returns a similarity score for each pair. The algorithm examines information such as - common tags, common commenters and common likers and is able to return a similarity score. The process goes like:

  • Each time a post is added, has a tag added, commented on or liked I add it to a queue.
  • I process this queue via cron (once a day), finding out the relevant information for each post, e.g. user_id's of the commenters and likers and tag_id's. I save this information to MongoDB in this kind of structure: {"post_id":1,"tag_ids":[12,44,67],"commenter_user_ids":[6,18,22],"liker_user_ids":[87,6]}. This allows me to eventually build up a MongoDB collection which gives me easy and quick access to all of the relevant information for when I try to calculate similarities
  • I then run another cron script (once a day also, but after the previous) which goes through the queue again. This time, for each post in the queue, I grab their entry from the MongoDB collection and compare it to all of the other entries. When 2 entries have some matching information, I give them +1 in terms of similarity. In the end I have an overall score for each pair of posts. I save the scores to a different MongoDB collection with the following structure: {"post_id":1,"similar":{"23":2,"2":5,"7":2}} ('similar' is a key=>value array with the post_id as key and the similarity score as the value. I don't save a score if it is 0.

I have 5k posts. So all of the above is quite hard on the server. There's a huge amount of reads and writes to be performed. Now, this is only half the issue. I then use this information to work out what posts would be interesting to a particular user. So, once an hour I run a cron script which runs a script that calculates 1 recommended post for each user on the site. The process goes like so:

  • The script first decides, which type of recommendation the user will get. It's a 50-50 change of - 1. A post similar to one of your posts or 2. A post similar to a post you have interacted with.
  • If 1, then the script grabs the users post_ids from MySQL, then uses them to grab their similar posts from MongoDB. The script takes the post that is most similar and has not yet been recommended to the user.
  • If 2, the script grabs all of the posts the user has commented on or liked from MySQL and uses their ids to do the same in 1 above.

Unfortunately the hourly recommendation script is getting very resource intensive and is slowly taking longer and longer to complete... currently 10-15 minutes. I'm worried that at some point I won't be able to provide hourly recommendations anymore.

I'm just wondering if anyone feels I could be approaching this any better?

2

There are 2 answers

0
JasonG On BEST ANSWER

I'm starting to plan how to do this. First thing is to possibly get rid of your database technology or supplement it with either triplestore or graph technologies. That should provide some better performance for analyzing similar likes or topics.

Next yes get a subset. Take a few interests that the user has and get a small pool of users that have similar interests.

Then build indexes of likes in some sort of meaningful order and count the inversions (divide and conquer - this is pretty similar to merge sort and you'll want to sort on your way out to count split inversions anyways).

I hope that helps - you don't want to compare everything to everything else or it's definately n2. You should be able to replace that with something somwhere between constant and linear if you take sets of people who have similar likes and use that.

For example, from a graph perspective, take something that they recently liked, and look at the in edges and then go trace them out and just analyze those users. Maybe do this on a few recently liked articles and then find a common set of users from that and use that for the collaborative filtering to find articles the user would likely enjoy. then you're at a workable problem size - especially in graph where there is no index growth (although maybe more in edges to traverse on the article - that just gives you more change of finding usable data though)

Even better would be to key the articles themselves so that if an article was liked by someone you can see articles that they may like based on other users (ie Amazon's 'users that bought this also bought').

Hope that gives a few ideas. For graph analysis there are some frameworks that may help like faunus for stats and derivitions.

1
symcbean On

With 5000 posts, that's 25,000,000 relationships, increasing O(n^2).

Your first problem is how you can avoid examining so many relationships every time the batch runs. Using tags or keywords will help with content matching - and you could use date ranges to limit common 'likes'. Beyond that....we'd to know a lot more about the methodology for establishing relationships.

Another consideration is when you establish relationships. Why are you waiting until the batch runs to compare a new post with existing data? Certainly it makes sense to handle this asynchronously to ensure that the request is processed quickly - but (other than the restrictions imposed by your platform) why wait until the batch kicks in before establishing the relationships? Use an asynchronous message queue.

Indeed depending on how long it takes to process a message, there may even be a case for re-generating cached relationship data when an item is retrieved rather than when it is created.

And if I were writing a platform to measure relationships with data then (the clue is in the name) I'd definitely be leaning towards a relational database where joins are easy and much of the logic can be implemented on the database tier.

It's certainly possible to reduce the length of time the system takes to cross-reference the data. This is exactly the kind of problem map-reduce is intended to address - but the benefits of this mainly come from being to run the algorithm in prallel across lots of machines - at the end of the day it takes just as many clock ticks.