Remove the tuple and create a new sorted list

571 views Asked by At

I have a RDD which I created using PySpark and sizes around 600 GB after joining by key value which looks exactly like this.

[('43.72_-70.08', (('0744632', -70.08, 43.72, '2.4'), '18090865')),
 ('43.72_-70.08', (('0744632', -70.08, 43.72, '2.4'), '18090865')),
 ('43.25_-67.58', (('0753877', -67.58, 43.25, '7.2'), '18050868')),
 ('43.01_-75.24', (('0750567', -75.24, 43.01, '7.2'), '18042872'))]

I want something like this and sorted by the first element:

['0744632', '18090865', '2.4',
'0744632', '18090865', '2.4',
'0750567', '18042872', '7.2',
'0753877', '18050868', '7.2']

Is there a way I can get data from tuples out and get the output in required format.

Note: This is a 600 GB RDD, with more than a million different values in first column and approx. 15 billion rows, I would really appreciate an optimized way if possible.

3

There are 3 answers

3
AChampion On BEST ANSWER

Do this in your spark cluster, e.g.:

In []:
(rdd.map(lambda x: (x[1][0][0], x[1][1], x[1][0][2]))
 .sortBy(lambda x: x[0])
 .flatMap(lambda x: x)
 .collect())

Out[]:
['0744632', '18090865', 43.72, '0744632', '18090865', 43.72, '0750567', 
 '18042872', 43.01, '0753877', '18050868', 43.25]

Alternatively

In []:
import operator as op

(rdd.map(lambda x: (x[1][0][0], x[1][1], x[1][0][2]))
 .sortBy(lambda x: x[0])
 .reduce(op.add))

Out[]:
('0744632', '18090865', 43.72, '0744632', '18090865', 43.72, '0750567', 
 '18042872', 43.01, '0753877', '18050868', 43.25)

This seems like a rather unwieldy structure, if you meant a list of tuples then simply eliminate the flatMap():

In []:
(rdd.map(lambda x: (x[1][0][0], x[1][1], x[1][0][2]))
 .sortBy(lambda x: x[0])
 .collect())

Out[]:
[('0744632', '18090865', 43.72),
 ('0744632', '18090865', 43.72),
 ('0750567', '18042872', 43.01),
 ('0753877', '18050868', 43.25)]
1
Kenan On

This is a simple one line solution

sorted([(x[1][0][0], x[1][1], x[1][0][3]) for x in your_list]) 

I think it's slightly faster than a lambda solution based on this post What is the difference between these two solutions - lambda or loop - Python

0
ags29 On

Similar to the other Spark answer:

rdd=rdd.map(lambda (a,(b,c)): [b[0], c, b[3]])\
       .sortBy(lambda row: row[0])

You can also use reduce instead of flatMap:

rdd.reduce(lambda x,y: x+y)