parent->child relationships in appengine python (bigtable)

2.8k views Asked by At

I'm still learning my lessons about data modeling in bigtable/nosql and would appreciate some feedback. Would it be fair to say that I should avoid parent->child relationships in my data modeling if I frequently need to deal with the children in aggregate across parents?

As an example, let's say I'm building a blog that will be contributed to by a number of authors, and each other has posts, and each post has tags. So I could potentially set up something like this:

class Author(db.Model): 
  owner = db.UserProperty()

class Post(db.Model): 
  owner = db.ReferenceProperty(Author, 
    collection_name='posts') 
  tags = db.StringListProperty() 

As I understand this will create an entity group based on the Author parent. Does this cause inefficiency if I mostly need to query for Posts by tags which I expect to cut across multiple Authors?

I understand doing a query on list properties can be inefficient. Let's say each post has about 3 tags on average, but could go all the way up to 7. And I expect my collection of possible tags to be in the low hundreds. Is there any benefit to altering that model to something like this?

class Author(db.Model): 
  owner = db.UserProperty()

class Post(db.Model): 
  owner = db.ReferenceProperty(Author, 
    collection_name='posts') 
  tags = db.ListProperty(db.Key)

class Tag(db.Model): 
  name = db.StringProperty() 

Or would I be better off doing something like this?

class Author(db.Model): 
  owner = db.UserProperty()

class Post(db.Model): 
  owner = db.ReferenceProperty(Author, 
    collection_name='posts')

class Tag(db.Model): 
  name = db.StringProperty() 

class PostTag(db.Model): 
  post = db.ReferenceProperty(Post, 
    collection_name='posts') 
  tag = db.ReferenceProperty(Tag, 
    collection_name='tags') 

And last question... what if my most common use case will be querying for posts by multiple tags. E.g., "find all posts with tags in {'apples', 'oranges', 'cucumbers', 'bicycles'}" Is one of these approaches more appropriate for a query that looks for posts that have any of a collection of tags?

Thanks, I know that was a mouthful. :-)

2

There are 2 answers

2
Robert Kluin On BEST ANSWER

Something like the first or second approach are well suited for App Engine. Consider the following setup:

class Author(db.Model): 
  owner = db.UserProperty()

class Post(db.Model): 
  author = db.ReferenceProperty(Author, 
    collection_name='posts') 
  tags = db.StringListProperty()

class Tag(db.Model): 
  post_count = db.IntegerProperty()

If you use the string tag (case-normalized) as the Tag entity key_name, you can efficiently query for posts with a specific tag, or list the tags of a post, or fetch tag statistics:

post = Post(author=some_author, tags=['app-engine', 'google', 'python'])
post_key = post.put()
# call some method to increment post counts...
increment_tag_post_counts(post_key)

# get posts with a given tag:
matching_posts = Post.all().filter('tags =', 'google').fetch(100)
# or, two tags:
matching_posts = Post.all().filter('tags =', 'google').filter('tags =', 'python').fetch(100)

# get tag list from a post:
tag_stats = Tag.get_by_key_name(post.tags)

The third approach requires additional queries or fetches for most basic operations, and it is more difficult if you want to query for multiple tags.

3
Kenan Banks On

I would choose the last approach, because it allows for retrieving a list of posts directly given a tag.

The first approach basically makes it impossible to keep a canonical set of tags. In other words, the question "what tags are currently present in the system" is very expensive to answer.

The second approach fixes that problem, but as I mentioned doesn't help you to retrieve posts given a tag.

Entity groups are a bit of a mysterious beast, but suffice it to say the first approach does NOT create an entity group, and that they are only necessary for transactional database operations, and sometimes useful for optimized data reads, but are probably unneeded in a smallish application.

It should be mentioned that any approach you take will only work well in conjunction with a smart caching strategy. GAE apps LOVE caching. Get intimate with the memcache api, and learn the bulk read/write operations on memcache and the datastore.