How partitioning and clustered by works in Hive table?

1k views Asked by At

I'm trying to understand below query by using that how data is going to be placed.

CREATE TABLE mytable ( 
         name string,
         city string,
         employee_id int ) 
PARTITIONED BY (year STRING, month STRING, day STRING) 
CLUSTERED BY (employee_id) INTO 256 BUCKETS

The keyword PARTITIONED BY will distribute the data in below like dir structure.

/user/hive/warehouse/mytable/y=2015/m=12/d=02

But am not able to understand, how employee_id will be distributed among these directories ? 256 buckets (files) will be created, and all those files will be having all employee_id but which file will sit under which dir, how that will be decided ?

Can anyone help me to understand this ?

1

There are 1 answers

0
leftjoin On BEST ANSWER

Directories are partitions or table location. Buckets are files inside these directories.

Complex partitions are hierarchical directories. In your case:

`/user/hive/warehouse/mytable/` - Table location, contains partition directories:
  `y=2015/` - year partition directory, contains months directories: 
    `m=12/` - month partition, contains days partitions directories:
      `d=02/` - day partition, contains 256 files(buckets)
        00000
        ...
        00255
     `d=03/` -Each day partition will contain 256 files (if you have enough data)
        00000
        ...
        00255     

Each file will contain not all employee_id. Which record will sit in which file is decided using this formula:

 bucket_number=hash_function(employee_id) MOD 256 

Where hash_function is integer, in case of Int employee_id it is equal to employee_id.

256 - is the number of buckets

MOD 256 will produce integer values in the range [0..255], corresponding to bucket numbers.

The same id's will always be in the same buckets. Each daily partition will contain it's own files(buckets), up to 256 buckets in each.

Say, employee_id = 1024 goes into bucket 0, if the same employee_id exists in many days, it will be in file 00000 in each day directory.

employee_id=1050 goes into file 000026 because 1050 MOD 256 = 26.

So, first the data is partitioned by partition key, inside partitions it is bucketed (distributed between files).