I'm using pySpark to count elements in a tokenized RDD. This is one of the elements:
('b00004tkvy', ['noah', 'ark', 'activity', 'center', 'jewel', 'case', 'ages', '3', '8', 'victory', 'multimedia'])
I have to count the number of elements in the full RDD. It returns only one value, as a single element list.
There is a function to do that. I used this code (of course it can be changed, but it must remain on a single line, the Return one):
def countTokens(RDD):
return RDD.map(lambda x :(1,len(x[1]))).reduceByKey(lambda x,y:x+y).map(lambda x: int(x[1])).collect()
print countTokens(aRecToToken)
print countTokens(bRecToToken)
totalTokens = countTokens(aRecToToken) + countTokens(bRecToToken)
the result is:
[167]
[58]
There are [167, 58] tokens.
At this point I don't know how to use the result as a value (integer) and not as list. My goal id to get
167
58
There are 225 tokens.
I hope someone can help me.
Thank you in advance.
the value is returning a list when you need the value in this 225. Adding [0] will give you this zeroth item in the list from which you can get your total.
But you really do not need the
if all your doing is totaling you just need len(x[1]) then to reduce like you have done.