I've been trying to use the "right" technology for a 360-degree customer application, it requires:
- A wide-column table, each customer is one row, with lots of columns (says > 1000)
- We have ~20 batch update analytics jobs running daily. Each analytics job queries and updates a small set of columns, for all the rows. It includes aggregating the data for reporting, and loading /saving the data for machine learning algorithms.
- We update customers' info in several columns, with <= 1 million rows per day. The update workload is spread out across working hours. We have more than 200 million rows.
I have tried using Hbase, the point 1 and 3 are met. But I found that doing analytics (load/save/aggregate) on HBase is painfully slow, it can be 10x slower than doing with Parquet. I don't understand why, both Parquet and Hbase are columnar DBs, and we have spread out the workload in the HBase cluster quite well ("requests per region" says so).
Any advices ? Am I using the wrong tool for the job ?
This asumption is wrong:
HFile
is not columnar oriented (Parquet is).HBase full scan is generally much slower than the equivalent HDFS raw file scan as HBase is optimized for random access patterns. You didn't specify how exactly did you scan the table -
TableSnapshotInputFileFormat
is much faster than the naiveTableInputFormat
, yet still slower than raw HDFS file scan.