Confused about the behavior of Reduce function in map reduce

140 views Asked by At

I'm having problems with the following map reduce exercise in Spark with python. My map function returns the following RDD.

rdd = [(3, ({0: [2], 1: [5], 3: [1]}, set([2]))),
(3, ({0: [4], 1: [3], 3: [5]}, set([1]))),
(1, ({0: [4, 5], 1: [2]}, set([3)))]

I wrote a reducer function that is supposed to do some computations on tuples with the same key (in the previous example the first two have key = 3, and the last key is 1)

def Reducer(k, v):
 cluster = k[0]
 rows = [k[1], v[1]]
 g_p = {} 
 I_p = set()
 for g, I in rows:
     g_p = CombineStatistics(g_p, g)
     I_p = I_p.union(I)
 return (cluster, [g_p, I_p]) 

The problem is that I'm expecting that k and v will always have the same key (i.e. k[0]==v[0]). But it is not the case with this code.

I'm working on Databricks platform, and honestly it is a nightmare not being able to debug, sometimes not even 'print' works. It's really frustrating to work in this environment.

1

There are 1 answers

2
Mariusz On BEST ANSWER

If you want to reduce RDD based on the same key you should use reduceByKey instead of reduce transformation. After replacing function name you should take into account that parameters to the reduceByKey function are values (k[1] and v[1] in your case), not whole rdd rows.

Prints inside reducer function will not work in distributed environment on databricks, because this function is evaluated on executors (inside amazon cloud). If you start spark in local mode, all python prints will work (but i'm not sure if local mode is available on databricks).