Combine output files of MapReduce job

Question

Combine output files of MapReduce job

3.6k views Asked by Arun Kumar At 14 December 2013 at 08:21

I have written a Mapper and Reducer in Python and have executed it successfully on Amazon's Elastic MapReduce(EMR) using Hadoop Streaming.

The final result folder contains the output in three different files part-00000, part-00001 and part-00002. But I need the output as one single file. Is there a way I can do that?

Here is my code for the Mapper:

#!/usr/bin/env python

import sys

for line in sys.stdin:
    line = line.strip()
    words = line.split()
    for word in words:
        print '%s\t%s' % (word, 1)

And here is my code for the Reducer

#!/usr/bin/env python

from operator import itemgetter
import sys

current_word = None
current_count = 0
word = None
max_count=0

for line in sys.stdin:
    line = line.strip()
    word, count = line.split('\t', 1)

    try:
        count = int(count)
    except ValueError:
        continue

if current_word == word:
    current_count += count
else:
    if current_word:
        # write result to STDOUT
            if current_word[0] != '@':
                print '%s\t%d' % (current_word, current_count)
                if count > max_count:
                    max_count = count
    current_count = count
    current_word = word

if current_word == word:
    print '%s\t%s' % (current_word, current_count)

I need the output of this as one single file.

Original Q&A

There are 4 answers

**AudioBubble** · Answer 1 · 2014-11-23T14:55:06+00:00

My solution to the above problem was to execute the following hdfs command:

hadoop fs -getmerge /hdfs/path local_file

where /hdfs/path is a path containing all the parts (part-*****) of a job output. The -getmerge option of the hadoop fs, will merge all of the job output into a single file on our local file system.

**James Mills** · Answer 2 · 2013-12-14T09:08:37+00:00

James Mills On 14 December 2013 at 09:08

A really simple way of doing this (assuming a Linux/UNIX sytem):

$ cat part-00000 part-00001 part-00002 > output

**Praveen Sripati** · Answer 3 · 2013-12-14T10:20:45+00:00

Praveen Sripati On 14 December 2013 at 10:20

Use a single reduce for small datasets/processing or use the getmerge option on the output files of the job.

**code_ada** · Answer 4 · 2015-04-01T16:46:02+00:00

I had the same problem lately, actually combiner should do this task but I couldn't implement somehow. What did I do is;

step1: mapper1.py reducer1.py

input: s3://../data/

output s3://..../small_output/
step2: mapper2.py reducer2.py

input s3://../data/

output: s3://..../output2/
step3: mapper3.py reducer3.py

input: s3://../output2/

output: s3://..../final_output/

I assume that we need output of step1 as a single file at the step3.

At the top of mapper2.py, there is this code;

if not os.path.isfile('/tmp/s3_sync_flag'):
    os.system('touch /tmp/s3_sync_flag')
    [download files to /tmp/output/]
    os.system('cat /tmp/output/part* > /tmp/output/all')

if block, checks against multiple mapper execution.

TechQA.

Combine output files of MapReduce job

There are 4 answers

Related Questions in PYTHON

Related Questions in HADOOP

Related Questions in MAPREDUCE

Related Questions in HADOOP-STREAMING

Related Questions in ELASTIC-MAP-REDUCE

Popular Questions

Popular Tags

Trending Questions