groupByKey not properly working in spark

979 views Asked by At

So, I have an RDD, which has key-value pair like following.

(Key1, Val1)
(Key1, Val2)
(Key1, Val3)
(Key2, Val4)
(Key2, Val5)

After groupByKey, I expect to get something like this

Key1, (Val1, Val2, Val3)
Key2, (Val4, Val5)

However, I see that same keys are being repeated even after doing groupByKey(). The total number of key value pairs are certainly reduced, but still there are many duplicate keys. What could be the problem?

The type of the key is basically a Java class with fields of integer types. Could it be that spark is also considering things other than the fields of the objects for identifying those objects?

1

There are 1 answers

0
Daniel Darabos On BEST ANSWER

groupByKey and a lot of other methods in Spark rely on object hashes. If two instances of your class do not return the same hashCode then Spark will not consider them equal even if all their fields are equal.

Make sure you override equals and hashCode!