writing/reading key/value pairs in sequence file format in Hadoop.

2.1k views Asked by At

I have a mapreduce program whose output is all in text files right now. A sample of the program is below. What I do not understand how to do is output the key/value pairs from the reducer in sequence file format. No, I can't use SequeceFileFormat specifier because I'm using the hadoop 0.20 library

So what do I do? Below is a sample The wordcount program is just one small part of my larger program. If I know how to do it w/ one, I can do it with the rest. Please help. Word Count Reducer

public void reduce(Text key, Iterable<IntWritable> values, Context context) 
  throws IOException, InterruptedException 
  {
    int sum = 0;
    for (IntWritable val : values) {
        sum += val.get();
    }
    System.out.println("reducer.output: "+key.toString()+" "+sum);

    context.write(key, new IntWritable(sum)); **//RIGHT HERE!! OUTPUTS TO TEXT**

}

}

Now here is the main program that runs this (I left out the mapper and other irrelevant details)

Configuration conf = new Configuration();

Job job = new Job(conf, "Terms");
job.setJarByClass(wordCount.class);

//Outputting key value pairs as a dictionary (rememb python)
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

//Setting the mapper and reducer classes
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);


//Setting the type of input format. In this case, plain TEXT
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);

I know how to convert a text file to a sequence file. I know how to do the opposite. That isn't the issue here. I couldn't find any example of actually doing this in a hadoop program which is why I am stuck.

So the output that I want is for this program to write the key/value pairs in a sequence file instead of a text file

I also want to know how to read IN a sequence file with the Mapper

Any help would be greatly appreciated.

1

There are 1 answers

0
Daniel Langdon On

I believe it suffices to change input and output formats. Key/value pairs should be the same once things are encoded/decoded correctly. So use:

import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;

&

job.setInputFormatClass(SequenceFileInputFormat.class);
job.setOutputFormatClass(SequenceFileOutputFormat.class);

Give it a try, as I have not done this in a while...