zipWithIndex fails in PySpark

Question

zipWithIndex fails in PySpark

466 views Asked by Hardik Gupta At 22 December 2016 at 15:27

I have an RDD like this

>>> termCounts.collect()
[(2, 'good'), (2, 'big'), (1, 'love'), (1, 'sucks'), (1, 'sachin'), (1, 'formulas'), (1, 'batsman'), (1, 'time'), (1, 'virat'), (1, 'modi')]

When am zipping this to create a dictionary, it gives me some random output

>>> vocabulary = termCounts.map(lambda x: x[1]).zipWithIndex().collectAsMap()
>>> vocabulary
{'formulas': 5, 'good': 0, 'love': 2, 'modi': 9, 'big': 1, 'batsman': 6, 'sucks': 3, 'time': 7, 'virat': 8, 'sachin': 4}

Is this the expected output? I wanted to create a dictionary with each word as key and their respective count as value

Original Q&A

There are 1 answers

**mrsrinivas** · Accepted Answer · 2016-12-22T17:07:53+00:00

mrsrinivas On 22 December 2016 at 17:07 BEST ANSWER

You need to write like this for word and occurance,

vocabulary =termCounts.map(lambda x: (x[1], x[0])).collectAsMap()

BTW, the code you have written will print the word and index of pair in list.

TechQA.

zipWithIndex fails in PySpark

There are 1 answers

Related Questions in APACHE-SPARK

Related Questions in PYSPARK

Related Questions in APACHE-SPARK-SQL

Related Questions in APACHE-SPARK-ML

Popular Questions

Popular Tags

Trending Questions