So, I have an RDD, which has key-value pair like following.
(Key1, Val1)
(Key1, Val2)
(Key1, Val3)
(Key2, Val4)
(Key2, Val5)
After groupByKey, I expect to get something like this
Key1, (Val1, Val2, Val3)
Key2, (Val4, Val5)
However, I see that same keys are being repeated even after doing groupByKey(). The total number of key value pairs are certainly reduced, but still there are many duplicate keys. What could be the problem?
The type of the key is basically a Java class with fields of integer types. Could it be that spark is also considering things other than the fields of the objects for identifying those objects?
groupByKey
and a lot of other methods in Spark rely on object hashes. If two instances of your class do not return the samehashCode
then Spark will not consider them equal even if all their fields are equal.Make sure you override
equals
andhashCode
!