Hive clustered by on more than one column

Question

Hive clustered by on more than one column

10.6k views Asked by Manikandan Kannan At 16 June 2015 at 15:02

I understand that when the hive table has clustered by on one column, then it performs a hash function of that bucketed column and then puts that row of data into one of the buckets. And there is a file for each bucket i.e. if there are 32 buckets then there are 32 files in hdfs.

What does it mean to have the clustered by on more than one column? For example, lets say that the table has CLUSTERED BY (continent, country) INTO 32 BUCKETS.

How would the hash function be performed if there are more than one column?

How many files would be generated? Is this still 32?

Original Q&A

There are 2 answers

krishna.kadigari On 16 June 2015 at 18:20

In general, the bucket number is determined by the expression hash_function(bucketing_column) mod num_buckets. (There's a '0x7FFFFFFF in there too, but that's not that important). The hash_function depends on the type of the bucketing column. For an int, it's easy, hash_int(i) == i. For example, if user_id were an int, and there were 10 buckets, we would expect all user_id's that end in 0 to be in bucket 1, all user_id's that end in a 1 to be in bucket 2, etc. For other datatypes, it's a little tricky. In particular, the hash of a BIGINT is not the same as the BIGINT. And the hash of a string or a complex datatype will be some number that's derived from the value, but not anything humanly-recognizable. For example, if user_id were a STRING, then the user_id's in bucket 1 would probably not end in 0. In general, distributing rows based on the hash will give you a even distribution in the buckets.

ref: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL+BucketedTables

**Maddy RS** · Accepted Answer · 2015-06-17T15:58:39+00:00

Maddy RS On 17 June 2015 at 15:58 BEST ANSWER

Yes the number of files will still be 32.
Hash function will operate by considering "continent,country" as a single string and then will use this as input.

Hope it helps!!

TechQA.

Hive clustered by on more than one column

There are 2 answers

Related Questions in HADOOP

Related Questions in HIVE

Related Questions in BUCKETS

Popular Questions

Trending Questions