This may sound basic but the question haunts me for a while.
Lets say i have the following query
SELECT s.ymd, s.symbol, s.price_close FROM stocks s
SORT BY s.symbol ASC;
In this case, if the data has good spread on the symbol column then it makes sense to distribute based on the symbol column so that all reducers get good share of the data; Changing the query to the following would give a better performance
SELECT s.ymd, s.symbol, s.price_close FROM stocks s
DISTRIBUTE BY s.symbol
SORT BY s.symbol ASC, s.ymd ASC;
What is the effect if i don't specify the distribute by clause? What is the default map output key column chosen in the first query i.e. what is the column that its distributed on?
I found the answer myself. With sort by, the output key from the mapper is not the column on which sort by is applied. The key could be the file offset of the record. The output from reducers is sorted per reducer but the same sort by column value can appear in the output of more than one reducers. This means that there is an overlap among the output of the reducers. Distribute by ensures that the data is split among the reducers based on the distribute by column and so by ensuring that the same column value go to the same reducer and so the same out file.