Suppose we are duplicating the Twitter's follow function. As far as I can tell, everyone now agrees to the following design using Redis.
All tweets followed by joe are stored in a sorted set "ss:joe" with key=tweet_id, score=tweet_timestamp
So when joe follows ladygaga, ladygaga's tweets are added to "ss:joe", so far so good.
The question is: how do I remove ladygaga's tweets from "ss:joe" when joe unfollows ladygaga?
Iterating through every single "ss:joe" tweet and remove those that belong to ladygaga is out.
The best I can think of is to maintain another sorted set for every user storing her own tweets, so ladygaga will have her sorted set "tweets:ladygaga" with key=tweet_id, score=tweet_timestamp, then we can pick out ladygaga's tweets by ZINTERSTORE "ss:joe" and "tweets:ladygaga".
Is there a better solution?
There is an even bigger problem to this design. Storing the
tweet_id
s inss:joe
means that the system cannot account forgaga
creating a new tweet (or deleting one, if that is supported) without also modifyingss:joe
. Now imagine having a few hundred celebs with 50,000 followers each, and each writing a dozen tweets per day. That's a lot of inserts into a lot of sets, which you cannot easily distribute either. And, it's a lot of duplicate data (remember redis is a RAM-only database, and although RAM gets cheaper it's still nowhere near "unlimited"). EDIT: And for updating follower records, you need to know the followers too (since iterating over every user on every newly written tweet is hardly an option). So you need to maintain a list of backlinks as well.An alternative design would be to store the user ID of the followed person in a set (or sorted set, if you will, so the user can shuffle the order). Each person further has a sorted set with all their tweet IDs (sorted by date).
This will require an additional query per followed person to get the tweet IDs, but it will reduce unfollowing to removing one value from a set, and it will keep everyone updated automatically as new tweets are created.
Lookups are less costly than inserts/removes (which may require rebalancing or rehashing), so even if you follow a dozen people, those extra queries probably aren't as much of an issue as would be more frequent updates.
Plus, lookups can actually happen on a network of replicated slaves (a second or two may pass before a new tweet is visible to everyone, but who cares -- it scales infinitely).