Configuring number of reducers for a particular Dofn in Apache crunch

538 views Asked by At

I understand that there are properties like CRUNCH_BYTES_PER_REDUCE_TASK or mapred.reduce.tasks to set number of reducers.

Can anyone suggest on configuring / overriding the default reducers for a particular Dofn which is taking more time to execute.

2

There are 2 answers

0
Shirish Kulkarni On BEST ANSWER

Reducers can be configured for particular DoFn by using the ParallelDoOptions and passing this as a 4th argument in parallelDo like this:

ParallelDoOptions opts = ParallelDoOptions.builder().conf("mapred.reduce.tasks", "64").build();

and pass this in parallelDo as 4th parameter.

0
Mzzzzzz On

Crunch's MapFn interface includes the scaleFactor method:

You can override the scaleFactor method in your custom DoFns in order to provide a hint to the Crunch planner about how much larger (or smaller) an input data set will become after passing through the process method. If the groupByKey method is called without an explicit number of reducers provided, the planner will try to guess how many reduce tasks should be used for the job based on the size of the input data, which is determined in part by using the result of calling the scaleFactor method on the DoFns in the processing path.

Source: http://crunch.apache.org/user-guide.html#doplan

Javadocs link: http://crunch.apache.org/apidocs/0.15.0/org/apache/crunch/DoFn.html#scaleFactor--