Pyspark Array Key,Value

1.7k views Asked by At

I currently have an RDD with an array that stores a key-value pair where the key is the 2D indices of the array and the value is the number at that spot. For example [((0,0),1),((0,1),2),((1,0),3),((1,1),4)] I want to add up the values of each key with the surrounding values. In relation to my earlier example, I want to add up 1,2,3 and place it in the (0,0) key value spot. How would I do this?

1

There are 1 answers

0
nikkitousen On

I would suggest you do the following:

  1. Define a function that, given a pair (i,j), returns a list with the pairs corresponding to the positions surrounding (i,j), plus the input pair (i,j). For instance, lets say the function is called surrounding_pairs(pair). Then:

    surrounding_pairs((0,0)) = [ (0,0), (0,1), (1,0) ]
    surrounding_pairs((2,3)) = [ (2,3), (2,2), (2,4), (1,3), (3,3) ]
    

    Of course, you need to be careful and return only valid positions.

  2. Use a flatMap on your RDD as follows:

    MyRDD = MyRDD.flatMap(lambda (pos, v): [(p, v) for p in surrounding_pairs(pos)])
    

    This will map your RDD from [((0,0),1),((0,1),2),((1,0),3),((1,1),4)] to

    [((0,0),1),((0,1),1),((1,0),1),
     ((0,1),2),((0,0),2),((1,1),2),
     ((1,0),3),((0,0),3),((1,1),3),
     ((1,1),4),((1,0),4),((0,1),4)]
    

    This way, the value at each position will be "copied" to the neighbour positions.

  3. Finally, just use a reduceByKey to add the corresponding values at each position:

    from operator import add
    MyRDD = MyRDD.reduceByKey(add)
    

I hope this makes sense.