App engine datastore denormalization: index properties in the main entity or the denormalized entity?

241 views Asked by At

Consider the classic example of blog data modelling, where we have a Blog entity with many properties, and we want to list the latest blogs in a page.

It makes sense to denormalize the BlogPost entity into a BlogPostSummary entity which will be shown in the list view, avoiding fetching and deserializing many unwanted properties.

class BlogPost(db.Model):
  title = db.StringProperty()
  content = db.TextProperty()
  created = db.DateProperty()
  ...

class BlogPostSummary(db.Model):
  title = db.StringProperty()
  content_excerpt = db.TextProperty()

The question is: which entity should hold the indexed properties? There are 3 options:

1. Index properties in both

  • Pros:
    • Easy query on both entities.
  • Cons:
    • Maintaining denormalized indexes is expensive.

2. Index properties in main entity only

  • Pros:
    • Indexing properties in the main entity is more safe, as the denormalized entity is treated as redundancy.
  • Cons:
    • Querying the list view will need a double roundtrip to datastore: One to key-only query for BlogPost entities, followed by a batch get for BlogPostSummary.

3. Index in denormalized entity only

  • Pros:
    • The list view can be easily built by a single query.
  • Cons:
    • The main entity cannot be queried by those properties anymore.
    • The indexes occupy more space when the denormalized entity is a child of the main entity.

Which option would work better? Are there other options?

Would the double round trip to datastore in option 2 be a problem?

2

There are 2 answers

4
Andrei Volgin On

This is an abstract question that does not have a "correct" answer. The choice of a data model depends on specific requirements of a project, including:

  • usage patterns (how often you need access to different data)
  • update patterns (e.g. separating frequently updated properties from stable properties in order to reduce write costs)
  • average performance and extreme-case performance requirements (e.g an average blog may have 10 posts, a very popular blog may have 10,000 posts)
  • ability to use memcache to reduce datastore trips and improve performance
  • data complexity (i.e. how many different kids of entities depend on this particular entity kind)
  • transaction requirements
  • security and access roles considerations (e.g. not exposing private data by mistake)

By the way, there is another way to model data in the Datastore - using child entities. For example, blog posts may be child entities of a blog entity. This way you can retrieve all blog posts with a single query by providing a parent key - without storing post IDs or keys in the blog entity or blog ID/key in the post entities.

1
Brent Washburne On

If you want to avoid fetching and deserializing unwanted properties, then put all your entities into a single model and use Projection queries:

class BlogPost(db.Model):
  title = db.StringProperty(indexed=True)
  content = db.TextProperty(indexed=False)
  content_excerpt = db.TextProperty(indexed=False)
  created = db.DateProperty(indexed=True)
  ...

posts = BlogPost.query().fetch(projection=["title", "created"])

(Note that the projection fields must be indexed.)