pySpark convert a list or RDD element to value (int)

Question

pySpark convert a list or RDD element to value (int)

9.8k views Asked by umb60 At 23 June 2015 at 22:43

I'm using pySpark to count elements in a tokenized RDD. This is one of the elements:

('b00004tkvy', ['noah', 'ark', 'activity', 'center', 'jewel', 'case', 'ages', '3', '8', 'victory', 'multimedia'])

I have to count the number of elements in the full RDD. It returns only one value, as a single element list.

There is a function to do that. I used this code (of course it can be changed, but it must remain on a single line, the Return one):

def countTokens(RDD):
    return RDD.map(lambda x :(1,len(x[1]))).reduceByKey(lambda x,y:x+y).map(lambda x: int(x[1])).collect()

print countTokens(aRecToToken)

print countTokens(bRecToToken)

totalTokens = countTokens(aRecToToken) + countTokens(bRecToToken)

the result is:

[167]
[58]
There are [167, 58] tokens.

At this point I don't know how to use the result as a value (integer) and not as list. My goal id to get

167
58    
There are 225 tokens.

I hope someone can help me.

Thank you in advance.

Original Q&A

There are 1 answers

**navarq** · Answer 1 · 2015-06-27T16:04:24+00:00

def countTokens(RDD):
    return RDD.map(lambda x :(1,len(x[1])))
              .reduceByKey(lambda x,y:x+y)
              .map(lambda x: int(x[1])).collect()[0]

the value is returning a list when you need the value in this 225. Adding [0] will give you this zeroth item in the list from which you can get your total.

But you really do not need the

x:(1,

if all your doing is totaling you just need len(x[1]) then to reduce like you have done.

TechQA.

pySpark convert a list or RDD element to value (int)

There are 1 answers

Related Questions in PYTHON

Related Questions in APACHE-SPARK

Related Questions in TOKENIZE

Related Questions in RDD

Related Questions in PYSPARK

Popular Questions

Trending Questions