is a column family placed one next to the other on disk in HBase? another words, is HBase Column-oriented?

477 views Asked by At

I'm trying to understand if HBase is a Column-oriented DB. I understand the structure of one row of HBase - it is divided into column families(which are static and don't change) and each column family can have dynamic number of columns:

row: row-key1, familyA:a1 familyA:a2... familyB:b1,familyB:b2,familyB:b3

Now it id stated that a column family is stored together on disk. so familyA:a1 familyA:a2 columns of row:row-key1 will be stored together on disk.

But what about familyA:a1 familyA:a2 values in two different rows? are they also store one after the other? which would me that HBase is Column-oriented.

Everywhere I look I see that HBase is Wide-Column store, is it the same as Column-oriented?

1

There are 1 answers

4
Ashu Pachauri On

Before answering the question, I want to point out one thing about the HBase use case that'll make it easier to understand the HFile layout. HBase (from read workload perspective) is optimized for random key value lookups in really long and wide tables (trillions of rows and millions of columns). It works reasonably well for rowkey prefix based scans too, but it's not built for large single column scans.

That said, HBase is not a truly columnar database, especially when seen as a wide column store too. HBase stores all columns for the same row key and the same column family together. However, different column families are stored in different files which gives the columnar nature to HBase in the sense that you can control configs for each column family independently and you can scan a single column family without worrying about read costs introduced due to columns in other families. This is how a single HFile looks like (notice that a column is called a qualifier in HBase. Also Type can be a Put or Delete):

RowKey1:Family1:Qualifier1:Timestamp1:Type:Value
RowKey1:Family1:Qualifier1:Timestamp2:Type:Value
RowKey1:Family1:Qualifier2:Timestamp0:Type:Value
RowKey1:Family1:Qualifier3:Timestamp2:Type:Value
RowKey2:Family1:Qualifier1:Timestamp0:Type:Value
RowKey2:Family1:Qualifier2:Timestamp2:Type:Value

Notice that Qualifier1 is not adjacent for RowKey1 and RowKey2. Instead, all columns for the same row i.e. RowKey1 key are adjacent.

If you stored every column in its own column family, HBase would become a truly columnar store, but then it would not be able to provide support for millions of columns due to single-row across-columns ACID semnatics that it offers due to its locking strategies to implement that.

Edit

Given the above structure of the HFile, the HFile data is actually stored in sorted format based on the following key (Note that one file can have one family only, so, storing the family name in the data itself is somewhat redundant, but there are other uses for that outside the scope of this question):

RowKey:Family:Qualifier:Timestamp:Type

This sorting order, combined with block level indexes and bloom filters on HFiles makes HBase blazing fast in locating any random RowKey, or a RowKey, Family:Qualifier tuple, or a RowKey, Family:Qualifier, Timestamp tuple.