Optimizing Push Task Queues

241 views Asked by At

I use Google App Engine (Python) as the backend of a mobile game, which includes social network integration (twitter) and global & relative leaderboards. My application makes use of two task queues, one for building out the relationships between players, and one for updating those objects when a player's score changes.

Model

 class RelativeUserScore(ndb.Model):
    ID_FORMAT = "%s:%s" # "friend_id:follower_id"
    #--- NDB Properties
    follower_id = ndb.StringProperty(indexed=True)          # the follower
    user_id = ndb.StringProperty(indexed=True)              # the followed (AKA friend)
    points = ndb.IntegerProperty(indexed=True)              # user data denormalization
    screen_name = ndb.StringProperty(indexed=False)         # user data denormalization
    profile_image_url = ndb.StringProperty(indexed=False)   # user data denormalization

This allows me to build the relative leaderboards by querying for objects where the requesting user is the follower.

Push Task Queues

I basically have two major tasks to be performed:

sync-twitter tasks will fetch the friends / followers from twitter's API, and build out the relative user score models. Friends are checked on user sign up, and again if their twitter friend count changes. Followers are only checked on user sign up. This runs in its own module with F4 instances, and min_idle_instances set to 20 (I'd like to reduce both settings if possible, though the instance memory usage requires at least F2 instances).

- name: sync-twitter
  target: sync-twitter                          # target version / module
  bucket_size: 100                              # default is 5, max is 100?
  max_concurrent_requests: 1000                 # default is 1000. what is the max?
  rate: 100/s                                   # default is 5/s. what is the max?
  retry_parameters:                            
    min_backoff_seconds: 30
    max_backoff_seconds: 900

update-leaderboard tasks will update all the user's objects after they play a game (which only takes about 2 minutes to do). This runs in its own module with F2 instances, and min_idle_instances set to 10 (I'd like to reduce both settings if possible).

- name: update-leaderboard
  target: update-leaderboard                    # target version / module
  bucket_size: 100                              # default is 5, max is 100?
  max_concurrent_requests: 1000                 # default is 1000. what is the max?
  rate: 100/s                                   # default is 5/s. what is the max?

I've already optimized these tasks to make them run asynchronously, and have reduced their run time significantly. Most of the time, the tasks take between .5 to 5 seconds. I've also put both task queues on their own dedicated module, and have automatic scaling turned up pretty high (and are using F4 and F2 server types respectively) However, I'm still running into a few issues.

As you can also see I've tried to max out the bucket_size and max_concurrent_requests, so that these tasks run as fast as possible.

Problems

  1. Every once in a while I get a DeadlineExceededError on the request handler that initiates the call. DeadlineExceededErrors: The API call taskqueue.BulkAdd() took too long to respond and was cancelled.
  2. Every once in a while I get a chunk of similar errors within the tasks themselves (for both task types): "Process terminated because the request deadline was exceeded during a loading request". (Note that this isn't listed as a DeadlineExceededError). The logs show these tasks took up the entire 600 seconds allowed. They end up getting rescheduled, and when they re-run, they only take the expected .5 to 5 seconds. I've tried using AppStats to gain more insight into whats going on, but these calls never get recorded as they get killed before appstats is able to save.
  3. With users updating their score as frequently as every two minutes, the update-leaderboard queue starts to fall behind somewhere around 10K CCU. I'd ideally like to be prepared for at least 100K CCU. (By CCU I'm meaning actual users playing our game, not number of concurrent requests, which is only about 500 front-end api requests/second at 25K users. - I use locust.io to load test)

Potential Optimizations / Questions

My first thought is maybe the first two issues deal with only having a single task queue for each of the task types. Maybe this is happening because the underlying Bigtable is splitting during these calls? (See this article, specifically "Queue Sharding for Stable Performance")

So, maybe sharding each queue into 10 different queues. I'd think problem #3 would also benefit from this queue sharding. So... 1. Any idea as to the underlying causes of problems #1 and #2? Would sharding actually help eliminate these errors? 2. If I do queue sharding, could I keep all the queues on their same respective module, and rely on its autoscaling to meet the demand? Or would I be better off with module per shard? 3. Any way to dynamically scale the sharding?

My next thought is to try and reduce the calls to update-leaderboard tasks. Something where not every complete game translates directly into a leaderboard update. But I'd need something where if the user only plays one game, its guaranteed to update the objects eventually. Any suggestions on implementing this reduction?

Finally, all of the modules' auto scaling parameters and the queue's parameters were set arbitrarily, trying to err on the side of maxing these out. Any advice on setting these appropriately so that I'm not spending any more resources than I need?

0

There are 0 answers