pySpark convert a list or RDD element to value (int)

9.7k views Asked by At

I'm using pySpark to count elements in a tokenized RDD. This is one of the elements:

('b00004tkvy', ['noah', 'ark', 'activity', 'center', 'jewel', 'case', 'ages', '3', '8', 'victory', 'multimedia'])

I have to count the number of elements in the full RDD. It returns only one value, as a single element list.

There is a function to do that. I used this code (of course it can be changed, but it must remain on a single line, the Return one):

def countTokens(RDD):
    return RDD.map(lambda x :(1,len(x[1]))).reduceByKey(lambda x,y:x+y).map(lambda x: int(x[1])).collect()

print countTokens(aRecToToken)

print countTokens(bRecToToken)

totalTokens = countTokens(aRecToToken) + countTokens(bRecToToken)

the result is:

[167]
[58]
There are [167, 58] tokens.

At this point I don't know how to use the result as a value (integer) and not as list. My goal id to get

167
58    
There are 225 tokens.

I hope someone can help me.

Thank you in advance.

1

There are 1 answers

0
navarq On
def countTokens(RDD):
    return RDD.map(lambda x :(1,len(x[1])))
              .reduceByKey(lambda x,y:x+y)
              .map(lambda x: int(x[1])).collect()[0]

the value is returning a list when you need the value in this 225. Adding [0] will give you this zeroth item in the list from which you can get your total.

But you really do not need the

x:(1, 

if all your doing is totaling you just need len(x[1]) then to reduce like you have done.