I'm trying to understand below query by using that how data is going to be placed.
CREATE TABLE mytable (
name string,
city string,
employee_id int )
PARTITIONED BY (year STRING, month STRING, day STRING)
CLUSTERED BY (employee_id) INTO 256 BUCKETS
The keyword PARTITIONED BY will distribute the data in below like dir structure.
/user/hive/warehouse/mytable/y=2015/m=12/d=02
But am not able to understand, how employee_id will be distributed among these directories ? 256 buckets (files) will be created, and all those files will be having all employee_id but which file will sit under which dir, how that will be decided ?
Can anyone help me to understand this ?
Directories are partitions or table location. Buckets are files inside these directories.
Complex partitions are hierarchical directories. In your case:
Each file will contain not all employee_id. Which record will sit in which file is decided using this formula:
Where
hash_functionis integer, in case of Intemployee_idit is equal to employee_id.256- is the number of bucketsMOD 256will produce integer values in the range [0..255], corresponding to bucket numbers.The same id's will always be in the same buckets. Each daily partition will contain it's own files(buckets), up to 256 buckets in each.
Say, employee_id = 1024 goes into bucket 0, if the same employee_id exists in many days, it will be in file 00000 in each day directory.
employee_id=1050goes into file 000026 because1050 MOD 256 = 26.So, first the data is partitioned by partition key, inside partitions it is bucketed (distributed between files).