Choosing right number of bukets in Hive table

484 views Asked by At

I have text files having 16 columns (user_id,date,other_attributes......) and the size of wholes files are 200GB and the number of rows are 1800 millions+. I have created an external table (ex_raw_data) to read these data. I have an aggreagation script that generates more than 200+ derived variables. When i ran my query on my external table (ex_raw_data) it is taking a lot of time to compute since my files are simple text files and number of rows and computed variables are huge. Now I am planning to create a new hive table that will store data in ORC format, but the problem is how can i decide the right number of buckets while creating the table. Below is my cluster configuration (on AWS) and ORC table structure:

    Namenode: 32 cores,60 GB RAM, 180GB SSD
    4 Data Nodes: 8 cores (on each data node), 30G RAM (on each data node),
    Total HDFS size 1.5 TB SSD 


create table raw_data_orc
(
user_id string,
date int,
attrib1 string,
attrib1 string,
attrib1 int,
.
.
.
)
CLUSTERED BY(user_id) INTO 32 BUCKETS 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
STORED AS ORC tblproperties ("orc.compress"="ZLIB");

INSERT OVERWRITE TABLE raw_data_orc select * from ex_raw_data;

What could be the best practices to decide the number of buckets in my ORC table (as per my above cluster configurations)?? Appreciate your help!

0

There are 0 answers