A nosql question:
Let's say I have this scenario:
A user as a status that change often (let's say every seconds), it also has different characteristics such as country (up to 10k characteristics per user)... A user can post messages which have different types.
The issue:
The scenario is in my opinion very RDS oriented where JOIN
would be used a lot for querying. However, it is not an option (for the sake of the exercise). Therefore, I am not looking for a solution with pseudo RDS like HIVE or other solution where you have pseudo join. I am looking for something like mongodb
which can use mapreduce
or aggregation
.
My solution (using mongodb):
Let's say you have 3 collections:
- user => user Characteristics (a large number of differents characteristics such as age/sex..)
- message => message specific field
- status => status specific field
The possible way to tackle this problem (as far as I know) are:
- Denormalize the data by duplicating the user field and
embed
it in message and status (or putting everything in one collection) => Does not seem optimized as you have a lot of characteristics per user and you will reach the 2MB limit of the documents (you could use GridFS but I am worried about the perf of something like that and you also duplicate tons of not useful data storage). - Use a SQL like solution by adding user_id
reference
to message and status => Seems like a reasonable solution. However You are then trapped (in terms of query performance) if you actually want to make specific queries such ascount the number of message of type X for users that have their last status equal to Z and have characteristic Y equal to E and group them by characteristic W
.In SQL it would beSELECT tmp.count(*), user.characteristicW FROM message INNER JOIN status on status.user_id=message.user_id INNER JOIN user ON user.id=message.user_id WHERE status.type=Z and user.characteristicZ=E group by user.characteristicW
(this query is actually not totally exact as you want to know if the last status is equal to Z and not if it ever had a status equal to Z, it would require a select within a select but that is not the point of the exercise). It becomes quickly very demanding where you have to make several queries (in this example one for getting the user id that have the last status equal to Z, then another one to filter this user id list to the one who have characteristic Y equal to E, then get all messages from these users and then group them by characteristic W with a map reduce job. - go with double reference user that ref to message and message to user. Also status that ref to user and user that ref to status. => Might seem fine but the user document already being big you come back to potentially issue with solution 1 assuming there will be tons of messages and status.
I went with option 2 but I am unhappy about it as the processing time to query it might not seem optimized.
Question:
In a scenario as above, what are the best practices to implement a scalable solution that allow complex querying as the example I gave above.