Python function to convert a list RDD into a pair RDD with the unique words and their count?

317 views Asked by At

How would I write a function to convert an RDD that is a list of words like ['Alpha', 'Beta', 'Gamma', 'Beta', 'Alpha'] into a pair RDD with the unique words and the number of times they appear, which in this case would be [('Alpha', 1), ('Beta',2), ('Gamma',2)] ?

1

There are 1 answers

0
Padraic Cunningham On

Use a collections.Counter dict:

from collections import Counter
print(Counter(['Alpha', 'Beta', 'Gamma', 'Beta', 'Alpha'])).items()
[('Alpha', 2), ('Beta', 2), ('Gamma', 1)]

If you want from lowest to highest frequency, use reversed with .most_common:

from collections import Counter
l = ['Alpha', 'Beta', 'Gamma', 'Beta', 'Alpha']
print(list(reversed(Counter(l).most_common())))
[('Gamma', 1), ('Beta', 2), ('Alpha', 2)]

If you want them in the order they first appear, use an OrderedDict:

from collections import OrderedDict
l = ['Alpha', 'Beta', 'Gamma', 'Beta', 'Alpha']
od = OrderedDict.fromkeys(l,0)

for ele in l:
    od[ele] += 1
print(od.items())
[('Alpha', 2), ('Beta', 2), ('Gamma', 1)]