Optimizing for Social Leaderboards

118 views Asked by At

I'm using Google App Engine (python) for the backend of a mobile social game. The game uses Twitter integration to allow people to follow relative leaderboards and play against their friends or followers.

By far the most expensive piece of the puzzle is the background (push) task that hits the Twitter API to query for the friends and followers of a given user, and then stores that data within our datastore. I'm trying to optimize that to reduce costs as much as possible.

The Data Model:

There are three main models related to this portion of the app:

User
'''General user info, like scores and stats'''
# key id => randomly generated string that uniquely identifies a user
#   along the lines of user_kdsgj326
#   (I realize I probably should have just used the integer ID that GAE
#   creates, but its too late for that)

AuthAccount
'''Authentication mechanism.
     A user may have multiple auth accounts- one for each provider'''
# key id => concatenation of the auth provider and the auth provider's unique
#   ID for that user, ie, "tw:555555", where '555555' is their twitter ID
auth_id = ndb.StringProperty(indexed=True) # ie, '555555'
user = ndb.KeyProperty(kind=User, indexed=True)
extra_data = ndb.JsonProperty(indexed=False) # twitter picture url, name, etc.

RelativeUserScore
'''Denormalization for quickly generated relative leaderboards'''
# key id => same as their User id, ie, user_kdsgj326, so that we can quickly
#     retrieve the object for each user
follower_ids = ndb.StringProperty(indexed=True, repeated=True)
# misc properties for the user's score, name, etc. needed for leaderboard

I don't think its necessary for this question, but just in case, here is a more detailed discussion that led to this design.

The Task

The background thread receives the twitter authentication data and requests a chunk of friend IDs from the Twitter API, via tweepy. Twitter sends up to 5000 friend IDs by default, and I'd rather not arbitrarily limit that more if I can avoid it (you can only make so many requests to their API per minute).

Once I get the list of the friend IDs, I can easily translate that into "tw:" AuthAccount key IDs, and use get_multi to retrieve the AuthAccounts. Then I remove all of the Null accounts for twitter users not in our system, and get all the user IDs for the twitter friends that are in our system. Those ids are also the keys of the RelativeUserScores, so I use a bunch of transactional_tasklets to add this user's ID to the RelativeUserScore's followers list.

The Optimization Questions

  1. The first thing that happens is a call to Twitter's API. Given that this is required for everything else in the task, I'm assuming I would not get any gains in making this asynchronous, correct? (GAE is already smart enough to use the server for handling other tasks while this one blocks?)
  2. When determining if a twitter friend is playing our game, I currently convert all twitter friend ids to auth account IDs, and retrieve by get_multi. Given that this data is sparse (most twitter friends will most likely not be playing our game), would I be better off with a projection query that just retrieves the user ID directly? Something like...

    twitter_friend_ids = twitter_api.friend_ids() # potentially 5000 values
    friend_system_ids = AuthAccount\
        .query(AuthAccount.auth_id.IN(twitter_friend_ids))\
        .fetch(projection=[AuthAccount.user_id])
    

    (I can't remember or find where, but I read this is better because you don't waste time attempting to read model objects that don't exist

  3. Whether I end up using get_multi or a projection query, is there any benefit to breaking up the request into multiple async queries, instead of trying to get / query for potentially 5000 objects at once?
1

There are 1 answers

6
Brent Washburne On BEST ANSWER

I would organize the task like this:

  1. Make an asynchronous fetch call to the Twitter feed
  2. Use memcache to hold all the AuthAccount->User data:
    • Request the data from memcache, if it doesn't exist then make a fetch_async() call to the AuthAccount to populate memcache and a local dict
  3. Run each of the twitter IDs through the dict

Here is some sample code:

future = twitter_api.friend_ids()    # make this asynchronous

auth_users = memcache.get('auth_users')
if auth_users is None:
    auth_accounts = AuthAccount.query()
                               .fetch(projection=[AuthAccount.auth_id,
                                                  AuthAccount.user_id])
    auth_users = dict([(a.auth_id, a.user_id) for a in auth_accounts])
    memcache.add('auth_users', auth_users, 60)

twitter_friend_ids = future.get_result()  # get async twitter results

friend_system_ids = []
for id in twitter_friend_ids:
    friend_id = auth_users.get("tw:%s" % id)
    if friend_id:
        friend_system_ids.append(friend_id)

This is optimized for a relatively smaller number of users and a high rate of requests. Your comments above indicate a higher number of users and a lower rate of requests, so I would only make this change to your code:

twitter_friend_ids = twitter_api.friend_ids() # potentially 5000 values
auth_account_keys = [ndb.Key("AuthAccount", "tw:%s" % id) for id in twitter_friend_ids]
friend_system_ids = filter(None, ndb.get_multi(auth_account_keys))

This will use ndb's built-in memcache to hold data when using get_multi() with keys.