RDBMS to NoSQL data denormalization

127 views Asked by At

I 'm looking for the most appropriate solution for data denormalization.

I have 5 tables in MySQL - posts, posts_to_groups, groups(MANY_2_MANY relationships between posts and groups), posts_comments and posts_votes.

I need to quickly load some number of posts from a specific groups that have the largest numbers of (comments + votes) during the specific time range - (Now-N(hours)). The performance is a criterion number one for me.

Right now I'm thinking on the solution with HBase or Cassandra.

For example if I have 1 post and this post is associated with a 3 groups, I can store for each comment or vote for this post 3(1 * groups number) new records into Hbase with a row key - group_id.post_id.timestamp but I'm not sure how to count in Hbase the largest number of comments + votes during the time range on the fly.

I'm sure there are best approaches so I would appreciate your advice and thoughts on this task.

1

There are 1 answers

1
Ramzy On

You said performance is the key. But there is more to it, as you want aggregation to be performed, and not just searching with a key(which is why Hbase is used for). Also, Hbase is good when you dont have a similar format for all the rows you are trying to add. I dont find any of these in your requirement to go for Hbase.

Also you are trying to combine 5 tables into one and expecting Hbase to give you, aggregation on top of it. May be too much. Hive is designed to do those things. Denormalization works. But that is used to search easily with keys and values. Your requirements stretches beyond denormalization and performing aggregations on top of it. Even I wait for more good answer on this aspect. Happy learning