HBase or Cassandra?

2.6k views Asked by At

In my lambda architecture, i am debating on whether to use HDFS or Cassandra to store my immutable data. I need Cassandra to serve the online requests etc. so it is the mandatory part of the tech stack. Now, I do not want to introduce new tool (HDFS) into the stack if I don't have to. So my question is, what will I be missing if I don't use HDFS and use Cassandra to host my immutable data as well.

EDIT:

I understand HDFS is a distributed filesystem and Cassandra is NoSQL DB. Still, both support data replication, both support high-throughput writes. In addition Cassandra supports low latent data retrieval. So am I right saying that HDFS isn't going to provide me much lift?

2

There are 2 answers

0
onrdncl On BEST ANSWER

As I understand You are trying to clarify your Serving Layer of your Lambda Architecture. If it is true, you want to store your batch views and real-time views into a Database. And as I understand you do not have Hadoop cluster in your batch layer. And your batch views have not been completed in HDFS. At this point your architecture is outside of HDFS. HBase is a distributed column-oriented database built on top of the Hadoop file system. It is an open-source project and is horizontally scalable. If you dont want a hadoop cluster, omit HBase. Cassandra is distributed NoSQL Database(column-oriented) and it works outside the Hadoop cluster and HDFS. If I understand your architecture and your needs right, I think Cassandra is best for you.

Additionally, you can get quick info about Lambda architecture from this link; http://artofbigdata.blogspot.com.tr/2016/01/lambda-architecture.html

0
sras On

HDFS supports different file formats to store. For example, sequence files, Avro and Parquet etc..so that you can choose a file format suitable to your application needs.

Also note that you can efficiently read the data using SQL-like queries.

So different data models are available in HDFS over Cassandra to host the data.