How to output multiple values with the same key in reducer?

664 views Asked by At

I have a bunch of text files which are categorized and I would like to create a sequence file for each category in which the key is the category name and the value consists of all the textual content of all the files for the category.

I have a nosql database which has only two columns. Each row represents a file, the first column is the category name and the second one is the absolute address of the text file stored on the HDFS. My mapper reads the database and output pairs in which the key is the category and the value is the absolute address. In the reducer sides, I have the addresses of all the files for each category and I would like to create one sequence files for each category in which the key is the category name and the value consists of the all textual content of all the files belonging to that category.

A simple solution is to iterate through the pairs (in the reducer) and open files one by one and append their content to a String variable and at the end create a sequence file using MultipleOutputs. However as the file sizes may be large appending the content to a single String may not be possible. Is there any way to do this without using a String variable?

1

There are 1 answers

2
Ramzy On

Then, since you have all the files in reducer, you can get the content of those files, and append using StringBuilder to save memory, and then discard that StringBuilder reference. If avoiding String is your question, StringBuilder is a quick way. The IO operaion involving the file access and reading is resource intensive. However the data itself, should be ok given the architecture of reducers in hadoop.

You can also think of using a combiner. However, that is mainly used to reduce the traffic between map and reduce. You can take advantage of preparing part of the sequence file, at combiner and then remaining at reducer level. ofcouse this is valid only if the content can be added as it comes and not based on specific order.