how to get the average number of words in a text in mrjob?

4.2k views Asked by At

Im stuck with a simple problem in mrjob mareduce framework: I want to get the average number of words in a given parragraph and i got this:

class LineAverage(MRJob):

def mapper(self, _, line):
    numwords = len(line.split())
    yield "words", numwords
    yield "lines", 1


def reducer(self, key, values):
    yield key, sum(values)

With this code, i get after reduce process, the total of lines and words in the text, but i dont know how to get the average by doing:

words/TotalOfLines

I am newbie in this model of programming, if anyone can illustrate this example it'll be very appreciated.

In the meantime, thank you so much for your attention and participation

2

There are 2 answers

0
Dade On BEST ANSWER

After all, the answer was simple: I actually sended to the reducer a number of values equal to the number of lines. So, in the reducer i just had to count the numer of values for the key.

class LineAverage(MRJob):

def mapper(self, _, line):
    numwords = len(line.split())
    yield "words", numwords


def reducer(self, key, values):
    i,totalL,totalW=0,0,0
    for i in values:
        totalL += 1
        totalW += i     
    yield "avg", totalW/float(totalL)

So the mapper sends for each line a pair ("words", x), the shuffle process will result in a tuple: ("words": x1, x2, x3,..xnumberOfLines) whic is the input for the reducer, then i just have to count the numbber of values for the key and thats it, i got the numer of lines.

Hope it will be helpfull for someone.

2
Cheng Chen On

In you reducer, you already output your key, sum(values) to the output files. You just need to read the output files into a Java/Scala program and calculate the average.