I'm having problems with the following map reduce exercise in Spark with python. My map function returns the following RDD.
rdd = [(3, ({0: [2], 1: [5], 3: [1]}, set([2]))),
(3, ({0: [4], 1: [3], 3: [5]}, set([1]))),
(1, ({0: [4, 5], 1: [2]}, set([3)))]
I wrote a reducer function that is supposed to do some computations on tuples with the same key (in the previous example the first two have key = 3, and the last key is 1)
def Reducer(k, v):
cluster = k[0]
rows = [k[1], v[1]]
g_p = {}
I_p = set()
for g, I in rows:
g_p = CombineStatistics(g_p, g)
I_p = I_p.union(I)
return (cluster, [g_p, I_p])
The problem is that I'm expecting that k and v will always have the same key (i.e. k[0]==v[0]
). But it is not the case with this code.
I'm working on Databricks platform, and honestly it is a nightmare not being able to debug, sometimes not even 'print' works. It's really frustrating to work in this environment.
If you want to reduce RDD based on the same key you should use
reduceByKey
instead ofreduce
transformation. After replacing function name you should take into account that parameters to thereduceByKey
function are values (k[1]
andv[1]
in your case), not whole rdd rows.Prints inside reducer function will not work in distributed environment on databricks, because this function is evaluated on executors (inside amazon cloud). If you start spark in local mode, all python prints will work (but i'm not sure if local mode is available on databricks).