Use case HBase on EMR

1.5k views Asked by At

I read the documentation on AWS, but a point is still unclear.

Is S3 the primary storage of EMR cluster? or does the data are in EC2 and S3 is just a copy?

In the doc :

  • "HBase on Amazon EMR provides the ability to back up your HBase data directly to Amazon Simple Storage Service (Amazon S3)"

  • "Hadoop clusters running on Amazon EMR use EC2 instances as virtual Linux servers for the master and slave nodes, Amazon S3 for bulk storage of input..."

  • "provides the ability to launch a new cluster and populate it with data from a previous HBase backup"

My use case : Use HBASE to store TB of data. Update my tables only three or two times a month by starting an emr cluster. Tables store on S3.

2

There are 2 answers

3
ChristopherB On BEST ANSWER

The key question in your use case is how the data should be available between updates.

If your goal is to have data accessible through a Hbase interface all the time then a Hbase cluster (like on EMR) would need to be up and running continually. Hbase currently only supports HDFS as live storage for Hfiles. S3 storage is external to the cluster and thus can be used as a destination for backups or other ingress/egress of data.

0
Sergei Rodionov On

As of EMR 5.2.0 you can run HBase 1.3.0 and higher directly on AWS S3.

The setting replaces the hfds:// protocol in the hbase-site.xml file:

"hbase.rootdir": "s3://my-bucket/hbase"

No changes to HBase clients are required. The configuration simplifies operations by eliminating the need to manage HDFS NameNode and DataNodes.