Explanation of performance considerations of read/write on Google Datastore (GAE)?

598 views Asked by At

I'm having a very difficult time understanding the mechanics of the Google App Engine Datastore.
I want to understand the mechanics so I can build my database in an optimal way for the database.

Given my example below, can someone help me in:

  1. structure the database optimally
  2. understand the performance of both read and write given that structure

Example:
Let's say I have N baseball players and each has a unique ID.
I'd like to keep a daily tally of the homeruns hit by each player (storing a "total daily homeruns" property) and basically increment it when a homerun is hit.
So, as time increases, I'd like to show a graph of homeruns each day for each baseball player over X years.

Player 1
1/21/2011 - 2 homeruns
1/22/2011 - 0 homeruns
1/23/2011 - 1 homeruns

Read Requirement: Read the last 5 years of daily "homerun" data for a certain player?

Write Requirement: Increment a daily homerun count for a certain baseball player.

I would love to understand how to structure the data and also the mechanics of both read and write? Will this simple storage task scale? Thanks all.

2

There are 2 answers

0
systempuntoout On BEST ANSWER

I would model your requirements with a one-to-many relationship like this:

class Player(db.Model):
  name = db.StringProperty()

class DailyHomeruns(db.Model):
  date = db.DateProperty()
  counter = db.IntegerProperty()
  player = db.ReferenceProperty(Player)

To retrieve all the DailyHomeruns of a given Player you can do it in this way:

daily_homeruns = DailyHomeruns.all().filter('player =', player)
                                    .filter('date >', date_start)
                                    .filter('date <=', date_end)
                                    .order('date')

Read Requirement:

Google App Engine performance queries scales with the size of the result set and not with the size of the data set.

This means that if your last 5 years homeruns query set contains in average 800 entities *, this query performs the same whether it searches over a thousand entities or a million entities.

Write Requirement:
Writes are slow in Google App Engine but your scenario seems quite trivial and I don't see any possible contention/timeout problem; after all you just need to serially update the DailyHomeruns incrementing the counter for a small number of times per day.

Other thoughts:
If you need to calculate some stats, for example the total number of Homeruns of a given Player ,don't even think to use GQL for this purpose because it does not provide any aggregate function à la SQL.
Instead, you have to design your database upfront, defining a model to store the total count of Homeruns per player.
Using the transactions API, each time you increment the DailyHomeruns you will need to increment the TotalHomeruns entity for that Player.

* I have estimated 3 matches per week for 52 weeks multiplied per 5 years

3
guigouz On

There's no single answer to this question. The datastore is really low level and it's up to you to create the right indexes and preprocess data so it can be retrieved faster. Also, depending on concurrent access to the same entity, you'd have to use pretty creative stuff like http://code.google.com/appengine/articles/sharding_counters.html

I can recommend you watching two Google I/O Sessions to get you started http://sites.google.com/site/io/under-the-covers-of-the-google-app-engine-datastore gives you a low-level overview of how everything works and why they were done this way (down to how the sectors are written to disk)

Then http://sites.google.com/site/io/building-scalable-web-applications-with-google-app-engine will show you how to use that low level stuff in real world applications.

There's another one which presents other solutions to common problems http://www.google.com/events/io/2009/sessions/BuildingScalableComplexApps.html - nice to open your mind to new kinds of solutions to the datastore's limitations.