Why is HBase full scan and aggregation slower than parquet, despite of also being columnar database?

Question

Why is HBase full scan and aggregation slower than parquet, despite of also being columnar database?

915 views Asked by Tung Vs At 16 July 2018 at 03:12

I've been trying to use the "right" technology for a 360-degree customer application, it requires:

A wide-column table, each customer is one row, with lots of columns (says > 1000)
We have ~20 batch update analytics jobs running daily. Each analytics job queries and updates a small set of columns, for all the rows. It includes aggregating the data for reporting, and loading /saving the data for machine learning algorithms.
We update customers' info in several columns, with <= 1 million rows per day. The update workload is spread out across working hours. We have more than 200 million rows.

I have tried using Hbase, the point 1 and 3 are met. But I found that doing analytics (load/save/aggregate) on HBase is painfully slow, it can be 10x slower than doing with Parquet. I don't understand why, both Parquet and Hbase are columnar DBs, and we have spread out the workload in the HBase cluster quite well ("requests per region" says so).

Any advices ? Am I using the wrong tool for the job ?

Original Q&A

There are 1 answers

**shay__** · Accepted Answer · 2018-07-16T07:43:35+00:00

both Parquet and Hbase are columnar DBs

This asumption is wrong:

Parquet is not a database.
HBase is not a columnar database. It is frequently regarded as one, but this is wrong. HFile is not columnar oriented (Parquet is).

HBase is painfully slow, it can be 10x slower than doing with Parquet

HBase full scan is generally much slower than the equivalent HDFS raw file scan as HBase is optimized for random access patterns. You didn't specify how exactly did you scan the table - TableSnapshotInputFileFormat is much faster than the naive TableInputFormat, yet still slower than raw HDFS file scan.

TechQA.

Why is HBase full scan and aggregation slower than parquet, despite of also being columnar database?

There are 1 answers

Related Questions in HBASE

Related Questions in AGGREGATE

Related Questions in PARQUET

Related Questions in NOSQL-AGGREGATION

Related Questions in COLUMN-AGGREGATION

Popular Questions

Popular Tags

Trending Questions