Hive distribute by vs without distribute by

Question

Hive distribute by vs without distribute by

3.2k views Asked by Manikandan Kannan At 21 June 2015 at 17:00

This may sound basic but the question haunts me for a while.

Lets say i have the following query

SELECT s.ymd, s.symbol, s.price_close FROM stocks s
SORT BY s.symbol ASC;

In this case, if the data has good spread on the symbol column then it makes sense to distribute based on the symbol column so that all reducers get good share of the data; Changing the query to the following would give a better performance

SELECT s.ymd, s.symbol, s.price_close FROM stocks s
DISTRIBUTE BY s.symbol
SORT BY s.symbol ASC, s.ymd ASC;

What is the effect if i don't specify the distribute by clause? What is the default map output key column chosen in the first query i.e. what is the column that its distributed on?

Original Q&A

There are 2 answers

jigarshah On 22 June 2015 at 15:52

Details are available. I think this is the answer you are looking for. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy

**Manikandan Kannan** · Accepted Answer · 2015-06-25T17:41:43+00:00

I found the answer myself. With sort by, the output key from the mapper is not the column on which sort by is applied. The key could be the file offset of the record. The output from reducers is sorted per reducer but the same sort by column value can appear in the output of more than one reducers. This means that there is an overlap among the output of the reducers. Distribute by ensures that the data is split among the reducers based on the distribute by column and so by ensuring that the same column value go to the same reducer and so the same out file.

TechQA.

Hive distribute by vs without distribute by

There are 2 answers

Related Questions in SORTING

Related Questions in HADOOP

Related Questions in HIVE

Related Questions in DISTRIBUTE

Popular Questions

Trending Questions