I have text files having 16 columns (user_id,date,other_attributes......) and the size of wholes files are 200GB and the number of rows are 1800 millions+. I have created an external table (ex_raw_data) to read these data. I have an aggreagation script that generates more than 200+ derived variables. When i ran my query on my external table (ex_raw_data) it is taking a lot of time to compute since my files are simple text files and number of rows and computed variables are huge. Now I am planning to create a new hive table that will store data in ORC format, but the problem is how can i decide the right number of buckets while creating the table. Below is my cluster configuration (on AWS) and ORC table structure:
Namenode: 32 cores,60 GB RAM, 180GB SSD
4 Data Nodes: 8 cores (on each data node), 30G RAM (on each data node),
Total HDFS size 1.5 TB SSD
create table raw_data_orc
(
user_id string,
date int,
attrib1 string,
attrib1 string,
attrib1 int,
.
.
.
)
CLUSTERED BY(user_id) INTO 32 BUCKETS
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS ORC tblproperties ("orc.compress"="ZLIB");
INSERT OVERWRITE TABLE raw_data_orc select * from ex_raw_data;
What could be the best practices to decide the number of buckets in my ORC table (as per my above cluster configurations)?? Appreciate your help!