mapreduce job not setting compression codec correctly

1.2k views Asked by At

Hi I have a MR2 job which takes avro data compressed with snappy as input, processes it and outputs the data into an output dir into avro format. The expectation is that this output avro data should also be snappy compressed but its not. The MR job is a map only job.

I have set the following properties in my code

conf.set("mapreduce.map.output.compress", "true"); conf.set("mapreduce.map.output.compress.codec", "org.apache.hadoop.io.compress.SnappyCodec");

But still the output is not snappy compressed

3

There are 3 answers

0
Vikas Saxena On BEST ANSWER

The following did the trick FileOutputFormat.setCompressOutput(job, true); FileOutputFormat.setOutputCompressorClass(job, org.apache.hadoop.io.compress.SnappyCodec.class);

Please note that this has do be done before setting the outputpath and in the same order as shown above.

0
gwgyk On

If you want to use snappy, just set parameter org.apache.hadoop.io.compress.SnappyCodec is not enough. You should download the snappy from google and build them, then copy the build files to the hadoop lib directory.

You can search on the google "how to use snappy on hadoop",there is a post, but it was write in Chinese. link

0
vefthym On

What you have now is a compression of the intermediate output of the map phase. Instead, you should use the following commands (see this presentation and especially slide 9 for more details):

conf.setOutputFormat(SequenceFileOutputFormat.class);
conf.set("mapreduce.output.fileoutputformat.compress.codec", "org.apache.hadoop.io.compress.SnappyCodec");

or any alternatives you wish, but do not include the word "map" in these configurations, otherwise it will be about the intermediate output.