I'm developing a Hbase storage for data generated from different sources. Usually columns from the same source are more likely to be retrieved at the same time. The expected write/read ratio roughly range from 1/10 to 1/100 (depends on different sources).
So there're two choices for me:
- Multiple column families: just create one table and multiple column families, each kinds of data from the same source will form a column family.
- Multiple tables: create one table (with one column family) for each source.
Here're some of my understanding, please correct me if anything wrong.
- Multiple-tables solution works fine for dynamic adding new sources. While multiple-column-families solution may have downtime.
- If the rowkey of different sources have different distribution (for example, int user_id vs image GUID) or cardinality, maybe it's better to split into different tables?
- We may have some requirements to retrieve columns from different sources for the same rowkey at the same time. In this way, multiple column families may be faster (not sure)?
Any suggestions or do I need to consider any other factors before make the decision? Are there any typical cases multiple-tables/multiple-column-families outperforms the other?
Thanks
Your points are correct, just follow the simple rule:
If data from different sources is related and has same keys or keys can be transformed to the same key, put it in the same table in different column families. You will get better scans and better data arrangement.
If data can't be stick together, put it to separate table. One big table will only cause problems: you'll have longer scans and most of the column families will be empty.