I am running an AWS EMR cluster with HBase installed, I followed these instructions for setting up the cluster using s3 as the Hbase datastore. The cluster is up and running and I am able to ssh in and use the hbase shell
with no problems.
The data we are trying to store is genomic data and very wide. For each row-key, there can be up to 250,000 column keys. We have experimented with different numbers of column families, from grouping all the keys in 1 column family, to using 42 different column families with the column keys spread out amongst them.
To interact with Hbase, we are using happybase
in python, which uses thrift
to communicate with the primary node. When retrieving a single row-key, it takes around 2.7s to return the result. I was expecting ms data retrieval times for this type of operation. When retrieving. Our configuration is very simple with no additional optimizations done. We are trying to decide if Hbase is the right application for our database needs but given the slow data retrieval times, we are leaning away from it.
I am aware that other large industry players use HBase for their needs and was wondering if anyone knows what things we can try to optimize performance? While these times aren't terrible, the application will eventually need to put thousands of row-keys and retrieve thousands of row-keys for all columns. Given the scaling we have seen so far, it would be untenable for our needs.
I have minimal experience with distributed NoSQL technologies like HBase so any suggestions or help would be appreciated.
Cluster setup:
1 Master node, 3 Core nodes
m4.large instances
Things we have tried:
- Adjusting number of column families
- Using HDFS instead of s3 as datastore