groupByKey not properly working in spark

Question

groupByKey not properly working in spark

963 views Asked by MetallicPriest At 09 June 2015 at 12:29

So, I have an RDD, which has key-value pair like following.

(Key1, Val1)
(Key1, Val2)
(Key1, Val3)
(Key2, Val4)
(Key2, Val5)

After groupByKey, I expect to get something like this

Key1, (Val1, Val2, Val3)
Key2, (Val4, Val5)

However, I see that same keys are being repeated even after doing groupByKey(). The total number of key value pairs are certainly reduced, but still there are many duplicate keys. What could be the problem?

The type of the key is basically a Java class with fields of integer types. Could it be that spark is also considering things other than the fields of the objects for identifying those objects?

Original Q&A

There are 1 answers

**Daniel Darabos** · Accepted Answer · 2015-06-09T12:55:10+00:00

Daniel Darabos On 09 June 2015 at 12:55 BEST ANSWER

groupByKey and a lot of other methods in Spark rely on object hashes. If two instances of your class do not return the same hashCode then Spark will not consider them equal even if all their fields are equal.

Make sure you override equals and hashCode!

TechQA.

groupByKey not properly working in spark

There are 1 answers

Related Questions in SCALA

Related Questions in MAPREDUCE

Related Questions in APACHE-SPARK

Popular Questions

Popular Tags

Trending Questions